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Section  1 


1.  Introduction 

The  design  of  a  distributed  database  management  system  (DBMS) 
involves  many  critical  design  decisions.  It  is  recognized  that  one  of 
the  most  important  of  these  design  decisions  is  the  choice  of  the  con¬ 
currency  control  algorithm  to  be  used.  Many  concurrency  control  algo¬ 
rithms  for  distributed  DBMSs  have  been  proposed  (Bern81a],  bnt  few  stu¬ 
dies  have  been  undertaken  to  rigorously  compare  their  performance 
[LIN81.  LINN82a.  MNN82b,  LXNN82C.  GARC78,  GARC7»a,  GALL82,  RIES79 a. 
RIES79b]  and  other  characteristics.  One  possible  reason  for  this  it 
that,  in  detail,  these  algorithms  seem  very  different,  thus  making  com¬ 
parison  difficult.  As  a  result,  the  distributed  DBMS  designer  finds  it 
difficult  to  choose  the  concurrency  control  algorithm  which  is  appropri¬ 
ate  given  the  design  parameters  of  the  particular  system  under  con¬ 
sideration. 

This  report  attempts  to  provide  a  handbook  of  information  about  a 
number  of  important  concurrency  control  algorithms  which  can  be  used  in 
the  design  of  a  distributed  DBIfit.  The  report  describes  a  framework  for 
distributed  DBMS  concurrency  control  which  abstracts  the  essential 
structure  of  these  algorithms  from  algorithmic  details,  and  classifies 
algorithms  within  this  framework.  The  report  then  summarizes  the 
results  of  a  detailed  simulation  study  of  the  performance  of  these  algo¬ 
rithms  based  on  the  framework.  For  various  system  and  application 
environments,  algorithms  are  ranked  according  to  their  performance. 
These  rankings  of  algorithms  can  guide  the  system  designer  in  selecting 
the  best  distributed  DBMS  concurrency  algorithm  for  his  system.  Addi¬ 
tional  details  of  the  simulation  results  can  be  found  in  an  Appendix, 
while  full  details  of  the  simulation  results  can  be  found  in  associated 
semi-annual  and  final  technical  reports  [LIN81a,  LlN82a,  LIN82b,  LIN83J . 

In  using  the  results,  the  system  designer  must  interpret  the  rank¬ 
ing  of  the  algorithms  in  the  context  of  the  performance  evaluation  model 
used  in~the  simulation.  The  model  either  does  not  simulate  or  makes 
assumptions  about  some  details  of  the  algorithms.  This  is  unavoidable 
in  any  simulation.  However,  the  model  used  here  captures  all  the 
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iaportant  factors  that  effect  the  perforaance  of  s  distributed  con¬ 
currency  control  algoritha:  10  delay,  coaaunication  delay,  CPU  delay, 
transaction  blocking  through  locking,  transaction  abortion  due  to  con¬ 
flict  or  deadlock,  overhead  for  deadlock  detection.  Thus,  the  aodel  is 
general  enough  to  apply  to  aost  cases. 

This  report  also  provides  a  basis  for  the  systea  designer  to  evalu¬ 
ate  different  database  recovery  algorithas.  Like  database  concurrency 
control,  database  recovery  has  also  been  studied  extensively,  aany  algo- 
rithas  have  been  proposed,  and,  on  the  surface,  the  algorithas  seen  very 
different.  However,  a  careful  exaaination  shows  that  aany  of  these 
algorithas  axe  quite  siailar.  This  report  describes  e  fraaework  for 
database  recovery  algorithas.  Within  this  fraaework,  the  aany  database 
recovery  algorithas  presented  in  the  literature  have  been  reduced  to 
four  categories.  This  fraaework  can  be  used  as  the  basis  to  compare  the 
algorithas.  Algorithas  belonging  to  the  saae  category  nay  differ  only 
in  ainor  details. 

The  report  is  organized  as  follows.  Section  2  describes  the  frame¬ 
work  for  distributed  DBMS  concurrency  control  algorithas,  and  section  3 
describes  the  fraaework  for  database  recovery  algorithas.  Section  4 
compares  the  perforaance  of  various  distributed  DBMS  concurrency  control 
algorithas,  using  the  fraaework  developed  in  Section  2.  Section  5  con¬ 
tains  a  list  of  references. 
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2.  A  Fraaework  For  Distributed  Database 
Concurrency  Control 


2.1  Introduction 

A  distributed  database  systea  (DDBS)  is  a  database  system  (DBS) 
that  provides  coaaands  to  read  and  write  data  that  is  stored  at  aultiple 
sites  of  a  network.  If  users  access  a  DDBS  concurrently,  they  may 
interfere  with  each  other  by  atteapting  to  read  and/or  write  the  same 
data.  Concurrency  control  is  the  activity  of  preventing  such  behavior. 

Dozens  of  algorithas  that  solve  the  DDBS  concurrency  control  prob¬ 
lem  have  been  published  (see  [BEBN82]  and  the  references).  Unfor¬ 
tunately,  many  of  these  algorithms  are  so  comp’ ex  that  only  an  expert 
can  understand  them. 

To  remedy  this  situation,  we  develop,  in  this  section,  a  simple 
fraaework  for  understanding  concurrency  control  algorithms.  The  frame¬ 
work  decomposes  the  problem  into  subprobleas  and  gives  basic  techniques 
for*  solving  each  subproblem.  To  understand  a  published  algorithm,  one 
first  identifies  the  technique  used  for  each  subproblem  and  then  checks 
to  see  whether  the  techniques  have  been  appropriately  combined.  The 
framework  can  also  be  used  to  develop  new  algorithms  by  combining  exist¬ 
ing  techniques  in  new  ways. 

This  section  has  eight  subsections.  Sections  2.2  and  2.3  set  the 
stage  by  describing  a  simple  DDBS  architecture  and  sketching  the  frame¬ 
work  in  terms  of  the  architecture.  The  fraaework  itself  appears  in  Sec¬ 
tions  2.4  through  2.8.  Section  2.9  uses  the  fraaework  to  explain 
several  published  algorithas.  Section  2.10  presents  a  summary. 
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2.2  Distributed  DBS  Architecture 

We  use  a  simple  model  of  DDBS  structure  and  behavior.  The  model 
highlights  those  aspects  of  a  DDBS  that  are  important  for  understanding 
concurrency  control,  while  hiding  details  that  don't  affect  concurrency 
control. 

A  database  consists  of  a  set  of  data  items,  denoted  (...,x,y,z). 
In  practice,  a  data  item  can  be  file,  record,  page,  etc.  But  for  the 
purposes  of  this  paper,  it's  best  to  think  of  a  data  item  as  a  simple 
variable.  For  now,  assume  each  data  item  is  stored  at  exactly  one  site. 

Users  access  data  items  by  issuing  Bead  and  Write  operations. 
Kead(x)  returns  the  current  value  of  x.  Write(x,new  value)  updates  the 
current  value  of  x  to  new-value. 

Users  interact  with  the  DBMS  by  executing  programs  called  transac- 
t ions .  A  transaction  only  interacts  with  the  outside  world  by  issuing 
Beads  and  Writes  to  the  DDBS  or  by  doing  terminal  I/O.  We  assume  that 
every  transaction  is  a  complete  and  correct  computation:  each  transac¬ 
tion,  if  executed  alone  on  an  initially  consistent  database,  would  ter¬ 
minate,  produce  correct  results,  and  leave  the  database  consistent. 

Each  site  of  a  DDBS  runs  one  or  more  of  the  following  software 
modules  (see  Figures  2.1  and  2.2):  a  transaction  manager  (TV),  a  data 
manager  (DM),  or  a  scheduler.  Transactions  talk  to  TV’s;  TV's  talk  to 
schedulers;  schedulers  talk  among  themselves  and  also  talk  to  UN's;  DM's 
manage  the  data. 

Each  transaction  also  issues  a  Begin  operation  to  its  TV  when  it 
starts  executing  and  an  End  when  it's  finished. 

The  TV  forwards  each  Bead  and  Write  to  a  scheduler.  (Which 
scheduler  depends  on  the  concurrency  control  algorithm;  usually  the 
scheduler  is  at  the  same  site  as  the  data  being  read  or  written.  In 
some  algorithms.  Begins  are  also  sent  to  schedulers.) 

The  scheduler  controls  the  order  in  which  DMs  process  Beads  and 
Writea.  When  a  scheduler  receives  a  Bead  or  Write  operation,  it  can 
either  output  the  operation  right  away  (usually  to  a  DM,  sometimes  to 
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Figure  2.1  DDBS  Architecture 
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Figure  2.2  Processing  Operations 
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another  scheduler),  delay  the  operation  by  holding  it  for  later  action, 
or  re  iec t  the  operation.  A  rejection  causes  the  system  to  abort  the 
transaction  that  issued  the  operation:  every  Write  processed  on  behalf 
of  the  transaction  is  undone  (restoring  the  old  value  of  the  data  item), 
and  every  transaction  that  read  a  value  written  by  the  aborted  transac¬ 
tion  is  also  aborted.  This  phenomenon  of  one  abort  triggering  other 
aborts  is  called  cascading  aborts .  (It  is  usually  avoided  in  commercial 
DBSs  by  not  allowing  a  transaction  to  read  another  transaction's  output 
until  the  DBS  is  certain  that  the  latter  transaction  will  not  abort.  In 
this  report,  we  will  not  try  to  prevent  cascading  aborts.)  Techniques 
for  implementing  abort  will  be  discussed  in  Section  3.  (See  [GRAY81 , 
HAMM 80 ,  LAMP76] . ) 

The  DM  executes  each  Read  and  Write  it  receives.  For  Read,  the  DM 
looks  in  its  local  database  and  returns  the  requested  value.  For  Write, 
the  DM  modifies  its  local  database  and  returns  an  acknowledgment.  The 
DM  sends  the  returned  value  or  acknowledgment  to  the  scheduler,  which 
relays  it  back  to  the  TM,  which  relays  it  back  to  the  transaction. 

DNs  do  not  necessarily  execute  operations  'first  come,  first 
served'.  If  a  DM  receives  a  Read(x)  and  a  Write(x)  at  about  the  same 
time,  the  DM  is  free  to  execute  these  operations  in  either  order.  If 
the  order  matters  (as  it  probably  does  in  this  case)  it  is  the 
scheduler's  responsibility  to  enforce  the  order.  This  is  done  by  using 
a  handshaking  communication  discipline  between  schedulers  and  DMs  (see 
Figure  2.3).  If  the  scheduler  wants  Read(x)  to  be  executed  before 
Write(x),  it  sends  Read(x)  to  the  DM,  waits  for  the  DM's  response,  and 
then  sends  Write(x).  Thus  the  scheduler  doesn't  even  send  Write(x)  to 
the  DM  until  it  knows  Read(x)  was  executed.  Of  course,  when  the  execu¬ 
tion  order  doesn't  matter,  the  scheduler  can  send  operations  without  the 
handshake. 

Handshaking  is  also  used  between  other  modules  when  execution  order 


is  important. 
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To  ezecnte  Bead(z)  on  behalf  of  transaction  1 
followed  by  Trite(z)  on  behalf  of  transaction  2 
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Figure  2.3  Handshaking 


2.3  The  Framework 

The  DDBS  nodules  that  are  nost  important  to  concurrency  control  are 
schedulers.  A  concurrency  control  algorithm  consists  of  some  number  of 
schedulers  that  run  some  type  of  scheduling  algorithm  in  a  centralized 
or  distributed  fashion.  In  addition,  the  concurrency  control  algorithm 
must  handle  'replicated  data'.  IK's  often  handle  this  problem. 

To  understand  a  concurrency  control  algorithm  using  our  framework 
one  must  determine; 

1.  The  true  si  schedulina  algorithm  used  (discussed  in  Sections  2.4  and 
2.7) 


■ 


’  •  . 
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2.  The  location  of  the  schedu ler(s)  (i.e.»  centralized  vs.  distributed 
(Section  2.5)) 

3.  How  replicated  data  is  handled  (Section  2.6) 


2.4  Schedulers 

There  are  four  types  of  schedulers:  two-phase  locking,  tiaestaap 
ordering,  serialization  graph  checking,  and  certifiers.  Each  type  can 
be  used  to  schedule  rw  conflicts,  ww  conflicts,  or  both.  This  section 
describes  each  type  of  scheduler  and  assuaes  that  it  is  used  for  both 
kinds  of  conflict.  Ways  of  coabining  scheduler  types  (e.g.,  two-phase 
locking  for  rw  conflicts  and  tiaestaap  ordering  for  ww  conflicts)  are 
described  in  Section  2.8.  This  section  also  assuaes  that  the  scheduler 
runs  at  a  single  site,(  see  Figure  2.4).  Section  2.5  lifts  this  res¬ 
triction. 


2.4.1  Two-Phase  Locking 

A  two-phase  locking  (2PL)  is  defined  by  three  rules  (EGLT): 

1.  Before  outputting  r^lx]  (resp.  WjliJ),  set  a  read-lock  (resp. 
write-lock)  for  T^  on  x.  The  lock  aust  be  held  (at  least)  until  the 
operation  is  executed  by  the  appropriate  DM.  (Handshaking  can  be 
used  to  guarantee  that  locks  are  held  long  enough.) 

2.  Different  transactions  cannot  simultaneously  hold  'conflicting' 
locks.  Two  locks  conflict  if  they  are  on  the  saae  data  itea  and  (at 
least)  one  is  a  write-lock.  If  rw  and  ww  scheduling  is  done 
separately,  the  definition  of  'conflict'  is  aodified.  For  rw 
scheduling,  two  locks  on  the  saae  data  itea  conflict  if  exactly  one 
is  a  write-lock  (i.e.,  write-locks  don't  conflict  with  each  other). 
For  ww  scheduling,  both  locks  aust  be  write  locks. 

3.  After  releasing  a  lock,  a  transaction  cannot  obtain  any  aore  locks. 

Buie  (3.)  causes  locks  to  be  obtained  in  a  two-phase  aanner.  Dur¬ 
ing  its  growing  phase,  a  transaction  obtains  locks  without  releasing 

\ 
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Figare  2.4  DDBS  Architecture  with  Centralized  Scheduler 

say.  By  releasing  a  lock,  the  transaction  enters  its  shrinking  phase 
during  which  it  can  only  release  locks.  Kule  (3.)  is  usually  imple¬ 
mented  by  holding  all  of  a  transaction's  locks  until  it  terminates. 

Due  to  Rule  (2.),  an  operation  received  by  a  scheduler  may  be 
delayed  because  another  transaction  already  owns  a  conflicting  lock. 
Suoh  blocking  situations  can  lead  to  deadlock.  For  example,  suppose 
rj[x]  and  ^[y]  set  read-locks,  and  then  the  scheduler  receives  Wjly] 
and  W2lx].  The  scheduler  cannot  set  the  write-lock  needed  by  W2ly] 
because  T2  holds  a  read-lock  on  y.  Nor  can  it  set  the  write-lock  for 
W2[x]  because  Tj  holds  a  read-lock  on  x.  And,  neither  Tj  nor  T2  can 
release  its  read-lock  before  getting  the  needed  write-lock  because  of 
rule  (3.).  Bence,  we  have  a  deadlock:  Tj  is  waiting  for  T2  which  is 
waiting  for  Tj. 

Deadlocks  can  be  characterized  by  a  waits-for  graph  [BOLT72, 
KING74],-  a  directed  graph  whose  nodes  represent  transactions  and  whose 
edges  represent  waiting  relationships.  Edge  Tj-^-Tj  means  Tj  is  waiting 
for  a  lock  owned  by  Tj.  A  deadlock  exists  if  and  only  if  (iff)  the 
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waits-for  graph  has  a  cycle.  For  example,  in  the  above  example  the 
waits-for  graph  is 


A  popular  way  of  handling  deadlock  is  to  maintain  the  waits-for 
graph  and  to  periodically  search  it  for  cycles.  (See  [Chap.  5,  AE075J 
for  cycle  detection  algorithms.)  When  a  deadlock  is  detected,  one  of 
the  transactions  on  the  cycle  is  aborted  and  restarted,  thereby  breaking 
the  deadlock. 

2.4.2  Timestamp  Ordering 

In  timestamp  ordering  (T/0)  each  transaction  is  assigned  a  globally 
unique  timestamp  by  its  TM.  (See  [BERN82 ,  THOM79]  for  how  this  is 
done.)  The  TM  attaches  the  timestamp  to  all  operations  issued  by  a  tran¬ 
saction.  A  T/O  scheduler  is  defined  by  a  single  rule:  Output  all  pairs 
of  conflicting  operations  in  timestamp  order.  Make  sure  conflicting 
operations  are  executed  by  DMs  in  the  order  they  were  output. 
(Handshaking  can  be  used  to  make  sure  of  this.)  As  for  2 PL,  the  defini¬ 
tion  of  'conflicting  operation'  is  modified  if  rw  and  ww  scheduling  are 
done  separately. 

Several  varieties  of  T/0  schedulers  have  been  proposed.  We  only 
sketch  these  variations  here.  Full  details  appear  in  [BERN82J. 

A  basic  T/O  scheduler  outputs  operations  in  essentially  first  come, 
first  served  order,  as  long  as  the  T/0  scheduling  rule  holds.  When  the 
scheduler  receives  r^[x]  it  does  the  following: 

if  TS(i)  <  largest  timestamp  of  any  Write  on  x  yet  'accepted* 
then  reject  r^x] 

else  'accept'  r^[x]  and  output  it  as  soon  as  all  Writes  on  x  with 
smaller  timestamp  have  been  acknowledged  by  the  DM. 

When  the  scheduler  receives  Wj[y]  it  behaves  as  follows. 

if  TS(i)  <  largest  timestamp  of  any  Read  or  Write  on  x  yet 
'accepted' 
then  reject  wjlx] 

else  'accept*  w^[x]  and  output  it  as  soon  as  all  Reads  and  Writes  on 
x  with  smaller  timestamp  have  been  acknowledged  by  the  DM . 
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A  conservative  I/fl  scheduler  avoids  rejections  by  delaying  opera¬ 
tions  instead.  An  operation  is  delayed  nntil  the  scheduler  is  sore  that 
outputting  it  will  cause  no  future  operations  to  be  rejected.  Conserva¬ 
tive  T/O  requires  that  each  scheduler  receive  Reads  and  Writes  from  each 
'EM  in  tiaestaap  order.  To  output  any  operation,  the  scheduler  must  have 
an  operation  froa  each  IK  in  its  'input  queue'.  The  scheduler  then 
'accepts'  the  operation  that  has  the  saallest  tiaestaap.  'Accept'  aeans 
to  reaove  the  operation  froa  the  input  queue  and  to  output  it  as  soon  as 
all  conflicting  operations  that  have  saaller  tiaestaap  have  been  ack¬ 
nowledged  by  the  DM.  Variations  on  conservative  T/O  are  discussed  in 
[BERN82,  BERN80a,  LIN79] . 

Basic  T/O  and  conservative  T/O  are  endpoints  of  a  spectrum.  Basic 
T/O  delays  operations  very  little,  but  it  tends  to  reject  aany  opera¬ 
tions.  Conservative  T/O  never  rejects  operations,  but  it  tends  to  delay 
then  often.  One  can  iaagine  T/O  achedulers  between  these  extremes.  To 
our  knowledge,  no  one  has  yet  proposed  such  a  scheduler. 

Thomas'  write  rule  (TWR)  is  a  technique  that  reduces  delay  and 
rejection  [THOM79] .  TWR  can  be  used  only  to  schedule  Writes,  and  it 
needs  to  be  coabined  with  basic  or  conservative  T/O  to  yield  a  complete 
scheduler.  If  we're  interested  only  in  ww  scheduling,  TWR  is  simple. 
When  the  scheduler  receives  w^ly]  it  does  the  following: 

if  XS(i)  <  largest  tiaestaap  of  any  Write  on  x  yet  'accepted' 
then  'pretend'  to  exeoute  w^[y]  (i.e.,  send  an  acknowledgement  back 
to  the  Hi,  but  don't  send  the  Write  *to  the  DM 
else  'accept'  w^[x]  and  process  it  as  usual. 

The  basic  T/O- TRW  coabination  works  like  this.  Reads  are  processed 
exactly  as  in  the  basic  T/O.  But  when  the  scheduler  receives  a  Wj[y], 
it  coabines  the  basic  T/O  rule  with  TWR  as  follows: 
if  TS(i)  <  largest  tiaestaap  of  any  Read 

on  x  yet  'accepted'  rw  scheduling  (basic  T/O) 
then  'reject'  WjtyJ 

Biself  TS(i)  <  largest  tiaestaap  of  any  Write  on  x  yet  'accepted' 
ww  scheduling  (TWR) 
then  'pretend*  to  execute  w^(y] 

else  'accept'  w^(x]  and  output  it  as  soon  as  all  operations  on  x  with 
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smaller  timestamp  have  been  acknowledged  by  the  DM. 

The  conservative  T/O-TWR  combination  is  described  in  [BERN82]. 


2.4.3  Serialization  Graph  Checking 

A  serialization  araoh  (SG)  is  a  directed  graph  whose  nodes  are 
transactions)  such  as  Tq,  ....  Tfc  —  and  whose  edges  ere  all  Ti“>Tj 
snch  that,  for  some  z,  either  (1.)  Ti  reads  z  before  Tj  writes  z,  or 
(2.)  writes  z  before  Tj  reads  z,  or  (3.)  Tj  writes  z  before  Tj  writes 
z.  A  serialization  graph  checking  scheduler  works  by  ezplicitly  build¬ 
ing  a  serialization  graph  (SG)  and  checking  it  for  cycles.  Like  basic 
T/O,  an  SG  checking  scheduler  never  delays  an  operation  (ezcept  for 
handshaking  reasons).  Rejection  is  the  only  action  used  to  avoid 
incorrect  ezecution. 

An  SG  checking  scheduler  is  defined  by  the  following  rules. 

1.  When  transaction  Tj  Begins,  add  node  TA  to  SG. 

2.  When  a  Read  or  Write  from  is  received,  add  all  edges  T^-WTj  such 

that  T-  is  a  node  of  SG,  and  the  scheduler  has  already  output  e  con- 
J 

flicting  operation  from  Tj.  As  for  the  previous  schedulers,  the 
definition  of  'conflicting  operation'  is  modified  if  rw  and  ww  con¬ 
flicts  are  scheduled  separately. 

3.  If  after  Rule  2  SG  is  still  acyclic,  output  the  operation.  Make 
sure  that  conflicting  operations  are  ezecuted  by  DMs  in  the  order 
they  were  output.  (Handshaking  can  be  used  for  this.) 

4.  If  after  Rule  2  SG  has  become  cyclic,  reject  the  operation.  Delete 

node  from  all  edges  or  Tj-^-T^  from  SG.  (SG  is  now  acyclic 

again. ) 

One  technical  problem  with  SG  checkers  is  that  a  transaction  must 
remain  in  SG  even  after  it  has  terminated.  A  transaction  can  be  deleted 
from  SG  only  when  it  is  a  source  node  of  the  graph  (i.e.,  when  it  has  no 
incoming  edges).  See  [CASA79]  for  a  discussion  of  this  problem  and  for 
techniques  that  efficiently  encode  information  about  terminated  transac¬ 
tions  that  remain  in  SG. 
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2.4.4  Certifiers 

The  tern  certifier  refers  to  a  scheduling  philosophy,  not  a 
specific  scheduling  rule.  A  certifier  is  a  scheduler  that  makes  its 
decisions  on  a  per-transaction  basis.  Vhen  a  certifier  receives  an 
operation,  it  internally  stores  information  about  the  operation  and  out¬ 
puts  it  as  soon  as  all  earlier  conflicting  operations  have  been  ack¬ 
nowledged.  When  a  transactions  ends,  its  Hi  sends  the  End  operation  and 
outputs  it  as  soon  as  all  earlier  conflicting  operations  have  been  ack¬ 
nowledged.  At  this  point,  the  certifier  checks  its  stored  information 
to  see  whether  the  transaction  executed  serializably.  If  it  did,  the 
certifier  certifies  the  transaction,  allowing  it  to  terminate;  other¬ 
wise,  the  certifier  aborts  the  transaction. 

All  of  the  earlier  schedulers  can  be  adapted  to  work  as  certifiers. 
Here  is  an  SG  checking  certifier.  When  a  certifier  receives  an  opera¬ 
tion,  it  adds  a  node  and  some  edges  to  SG  as  explained  in  the  previous 

section.  The  certifier  does  not  check  for  cycles  at  this  time.  When  a 
transaction,  Tj.  ends,  the  certifier  checks  SG  for  cycles.  If  Ti  does 
not  lie  on  a  cycle,  it  is  certified;  otherwise  it  is  aborted. 

Here  is  a  2 PL  certifier  [1HOM79,  KUNG79] .  Define  a  transaction  to 
be  active  from  the  time  the  certifier  receives  its  first  operation  until 
the  certifier  processes  its  End.  The  certifier  stores  two  sets  for  each 
active  transaction  T^: 

Ti's  readset.  RS(i)  *  lx  I  the  certifier  has  output  r^lx]} 

Tj's  writeset.  WS(i)  *  {xlthe  certifier  has  output  w^[x]}. 

The  certifier  updates  these  sets  as  it  receives  operations.  When  the 
certifier  receives  End^,  it  runs  the  following  test: 

Let  KS(active)  *  U(RS(j),  such  Tj  is  active,  but  j  +  i) 

WS(active)  *=  D(WS(j),  such  Tj  is  active,  but  j  +  i) 

if  BS(i)  WS(active)  f  0,  or 

WS( i)  RS(active)  UWS(active)  0  0 
then  certify  TA 
else  abort  Tj. 

This  amounts  to  pretending  that  transactions  hold  imaginary  locks 
on  their  readsets  and  writesets.  When  tranaaction  TA  ends,  the  certif¬ 
ier  sees  whether  T^'s  imaginary  locks  conflict  with  the  imaginary  looks 
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keld  by  other  active  transactions.  If  there  is  no  conflict,  is  cer¬ 
tified.  Otherwise  i..  is  aborted. 

T/O  certifiers  are  also  possible.  To  our  knowledge,  no  one  has 
proposed  this  algoritha  yet.  Certifiers  also  can  be  built  that  will 
check  for  serializable  executions  during  transactions'  executions,  not 
just  at  the  end.  The  extreae  version  of  this  idea  is  to  check  for  seri- 
alizability  on  every  operation.  At  this  extreae,  the  certifier  reduces 
to  a  'normal'  scheduler. 


2.5  Scheduler  Location 

The  schedulers  of  Section  2.5  can  be  modified  to  work  in  a  distri¬ 
buted  manner.  Instead  of  one  scheduler  for  the  whole  system,  we  now 
assume  one  scheduler  per  DM  (refer  back  to  Figure  2.1).  The  scheduler 
normally  runs  at  the  same  site  as  the  DM  and  schedules  all  operations 
that  the  DM  executes. 

The  new  issue  in  this  setting  is  that  the  distributed  schedulers 
aust  cooperate  to  attain  the  scheduling  rules  of  Section  2.5. 

The  aain  problem  caused  by  distributed  schedulers  is  the  mainte¬ 
nance  of  global  data  structures.  Distributed  2PL  schedulers  need  a  glo¬ 
bal  waits-for  graph.  Distributed  SG  checkers  need  a  global  SG.  In  dis¬ 
tributed  T/0  scheduling,  no  global  data  structures  are  needed;  each 
scheduler  can  make  its  scheduling  decisions  using  local  copies  of  R- 
TS(x)  and  f-TS(x)  for  each  x  at  its  DM.  Distributed  certifiers  gen¬ 
erally  manifest  the  same  problems  as  their  corresponding  schedulers. 


2.5.1  Distributed  Two-Phase  Locking 

Refer  to  the  2 PL  scheduling  rules  of  Section  2.5.1.  Rules  (1.)  and 
(2.)  are  'local'.  The  scheduler  for  data  item  x  schedules  all  coopera¬ 
tions  on  x.  Hence  this  scheduler  can  set  all  locks  on  x.  Rule  (3.) 
requires  a  small  amount  of  inter-scheduler  cooperation,  no  scheduler  can 
obtain  a  lock  for  transaction  Tj  after  any  scheduler  releases  a  lock  for 
T^.  This  can  be  done  by  handshaking  between  TV's  and  schedulers.  When 
Ti  Ends,  its  IM  waits  until  all  of  T^s  Reads  and  Writes  are 
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acknowledged.  At  tbis  point  the  TM  knows  that  all  of  T^'s  locks  are  set 
and  that  it's  safe  to  release  locks.  The  TV  forwards  Endi  to  the 
schedulers,  which  then  release  T^'s  locks. 

One  problea  with  distributed  2PL  is  that  anlti-site  deadlocks  are 
possible.  Suppose  x  and  y  are  stored  at  sites  A  and  B,  respectively. 
Suppose  that  r^[y]  is  processed  at  A.  setting  a  read-lock  on  x  for  T,  at 
A;  and  suppose  that  Xj [x]  is  processed  at  site  B,  setting  a  read-lock  on 
y  for  Tj  at  B.  If  Wj  [x]  and  w^y]  are  now  issued,  a  deadlock  will 
result;  Tj  will  be  waiting  for  T^  to  release  its  lock  on  x  at  A,  and  T^ 
will  be  waiting  for  T,  to  release  its  lock  on  y  at  B.  Unfortunately, 

J 

the  deadlock  isn't  apparent  by  looking  at  site  A  or  site  B  alone.  Only 
when  taking  the  union  of  the  waits-for  graph  at  both  sites  does  the 
deadlock  cycle  materialise. 

See  [MENA79,  ST0N79,  GLIG80,  ROSE78]  for  solutions  to  this  problem. 

2.3.2  Distributed  Timestamp  Ordering 

T/O  schedulers  are  easy  to  distribute  because  the  T/0  scheduling 
rule  of  Section  2.5.2  is  inherently  local.  Consider  a  basic  T/O 
scheduler  for  data  item  x.  To  process  an  operation  on  the  scheduler 
needs  to  know  only  whether  a  conflicting  operation  that  has  a  larger 
timestamp  has  been  accepted.  Since  the  scheduler  handles  all  operations 
on  x,  it  can  make  this  decision  itself. 


2.5.3  Distributed  Serialization  Graph  Checking 

SG  checkers  are  harder  to  distribute  than  the  other  schedulers 
because  the  serialization  graph  (SG)  is  inherently  globsl.  A  transac¬ 
tion  that  accesses  data  at  a  single  site  can  become  involved  in  a  cycle 
that  spans  many  sites.  See  [GASA79]  for  a  discussion  of  this  problem. 


Page  2-14 
Section  2 


Distributed  Database  Systea  Designer  Handbook 
A  Fraaework  For  Distributed  Database 
Concurrency  Control 


2.5.4  Distributed  Certifiers 

Distributed  certifiers  have  a  synchronization  requireaent  a  bit 
like  Rule  (3.)  of  2PL:  T^'s  TM  aust  not  send  End^  to  any  certifier 
until  all  of  T^'s  Reads  and  Vrites  have  been  acknowledged.  (i.e.,  we 

aust  not  try  to  certify  at  any  site  until  we  are  ready  to  certify 

at  all  sites). 

Beyond  this,  each  distributed  certifier  behaves  like  the 
corresponding  scheduler.  A  distributed  2PL  certifier  needs  little 
inter-scheduler  cooperation  (beyond  that  described  in  the  previous  para¬ 
graph).  The  certifier  at  each  site  keeps  track  of  the  data  iteas  at  its 
site  read  or  written  by  active  transactions.  When  a  certifier  at  site  A 
receives  End^,  it  sees  whether  any  active  transaction  conflicts  with  Tj 
at  site  A.  If  not,  is  certified  at  site  A*  If  T^  is  certified  at 
all  sites  at  which  it  accessed  data,  then  it  is  'really'  certified;  oth¬ 
erwise  it  is  aborted. 

A  distributed  SG  certifier  shares  the  probleas  of  distributed  SG 
schedulers.  The  certifier  needs  to  check  for  cycles  in  a  global  graph 
every  tiae  a  transaction  ends. 


2.5.5  Other  Architectures 

Centralized  and  distributed  scheduling  are  endpoints  of  a  spectrua. 
One  can  iaagine  hybrid  architectures  that  feature  aultiple  DMs  per 
scheduler.  See  Figure  2.5.  This  architecture  adds  no  technical  isaues 
beyond  those  already  discussed. 

Hierarchical  scheduler  architectures  are  also  possible.  See  Figure 
2.6.  To  our  knowledge,  no  one  has  studied  this  approach. 


2.6  Data  Replication 

In  a  replicated  database,  each  loaical  data  itea,  z,  can  have  aany 
physical  copies  denoted  (z^,...,zB},  which  are  resident  at  different 
DMs.  Transactions  issue  Reads  and  Vrites  on  logical  data  iteas.  Hi's 
translate  those  operations  into  Reads  and  Writes  on  physical  data.  The 
effect,  as  seen  by  each  transaction,  aust  be  as  if  there  were  only  one 

f 
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Figure  2.5  Hybrid  Architecture 


Figure  2.6  Hierarchical  Architecture 
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copy  of  each  data  item. 


There  is  a  simple  way  to  obtain  this  effect.  Each  TN  translates 
rjlx]  into  rA[xj]  for  some  copy  Xj  of  x  and  wi[x]  into  {w^lxjjlall 
copies  Xj  of  x}.  If  the  scheduler(s)  is  SR,  the  effect  is  just  like  a 
nonreplicated  database.  To  see  this,  consider  a  serial  log  equivalent 
to  the  SR  log  that  executed.  Since  each  transaction  writes  into  all 
copies  of  each  logical  data  item,  each  r^lxj]  read  from  the  'latest' 
transaction  preceding  it  that  wrote  into  any  copy  of  x.  But  this  is 
exactly  what  would  have  happened  had  there  been  only  one  copy  of  x. 
(For  a  more  rigorous  explanation,  see  [ATTAS2 ] . )  Consider  this  example: 


»0[xl] 


w0[x2] 


Vyl] 

w0[y2] 


-r,  [x,  ] - *.w,  [x,] 

r  t  i 

ri  tyi  J — 


r2[x2] - m.w2[yi] 

*-r2  [y2J - *.w2(y2] 


x2  and  x2  are  copies  of  logical  data  item  x;  y2  and  y2  are  copies  of  y. 
Tq  produces  initial  values  for  both  copies  of  each  data  item.  Tj  reads 
x  and  y,  and  writes  x;  T2  reads  x  and  y,  and  writes  y. 


Lj  is  SR.  It  is  equivalent  to  the  following  serial  log: 

L4  =  w0  ^X1  ^  w0[x2J  w0lylJ  w0Iy2]  rlIxlJ  rlfylJ  wlIxl]  wlIx2J 

*2U2]  r2^y2*  w2^yl^  w2^y2^  * 

Note  that  each  Read,  e.g.,  r2[x2]  or  r2[y2],  reads  from  the  'latest* 
transaction  preceding  it  that  wrote  into  any  copy  of  the  data  item. 
Therefore,  has  the  same  effect  as  the  following  log  in  which  there  is 
no  replicated  data: 


L4  =  Vxl  wotyl  r  1 1  *  1  rilyl  vjlaJ  r  2  t  x  J  r2ly]  *2ly]  . 

We  call  this  the  do  nothina  approach  to  replication  —  just  write 
into  all  copies  of  each  data  item  and  use  an  SR  scheduler. 

Two  other  approaches  to  replication  have  been  suggested.  In  the 
primary  copy  approach,  some  copy  of  each  x,  say  xp,  is  designated  as  its 
primary  copy  [STON79].  Each  TM  translates  r^[x]  u  o  r^(xj]  for  some 
copy  Xj ,  as  before.  Writes  are  translated  di  'ferently.  The  TN 
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Begin 


1M  Scheduler  DM 


Note:  Xj  is  pr inary  copy 

Figure  2.7  Processing  Writes  in  Prinary  Copy 

translates  w^[x]  into  a  single  Writ e,  w^[Xp],  on  the  primary  copy.  When 
the  prinary  copy's  scheduler  outputs  w^[xp],  it  also  issues  Writes  on 
the  other  copies  of  x  (i.e.,  v^[x^] ,. . . ,V|[xa]) .  See  Figure  2.7.  These 
Writes  are  processed  by  the  schedulers  for  x^,....^  in  the  usual  way. 
For  example,  in  2 PL,  the  scheduler  for  Xj  must  put  s  write-lock  on  tj 
for  before  outputting  w^IXj].  The  prinsry  copy’s  scheduler  nay  be 
centralized  (in  which  case  the  technique  is  cslled  pt inarv  site 
[ALSB76]),  or  distributed  with  the  priaury  copy's  DM. 

Prinary  copy  is  a  good  idea  for  2 PL  schedulers.  It  eliminates  the 
possibility  of  deadlock  caused  by  Writes  on  different  copies  of  one  data 
item.  Suppose  x  has  copies  Xj  and  X2.  Suppose  that  Tj  and  T2  want  to 
Write  x  at  about  the  sane  tine.  In  the  do  nothing  approach,  the  follow¬ 
ing  exeention  is  possible:  Tj  locks  x^;  T2  locks  j.2‘  trie*  to  lock 
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»2  but  is  blocked  by  T2#s  lock;  T2  tries  to  lock  Xj  but  is  blocked  by 
Tj * s  lock.  This  is  a  deadlock.  Primary  copy  avoids  this  possibility 
because  each  transaction  must  lock  the  primary  copy  first. 

In  the  voting  approach  to  replication.  His  again  distribute  Writes 
to  all  copies  of  each  data  item  [THOM79].  Assume  that  we  are  using  dis¬ 
tributed  schedulers.  When  a  scheduler  is  ready  to  output  w^[xj],  it 
sends  a  vote  of  yes  to  the  vote  collector  for  x;  it  does  not  output 
w^[Xj]  at  this  time.  When  the  vote  collector  receives  yes  votes  from  a 
majority  of  schedulers,  it  tells  all  schedulers  to  output  their  Writes. 
(Each  scheduler  may  need  to  update  its  local  data  structures  before  out- 
putting  w^[Xj]  (e.g.,  set  a  write-lock  on  x j . ) )  Assume  each  scheduler  is 
correct  (i.e.,  produces  an  acyclic  SG).  Then,  since  every  pair  of  con¬ 
flicting  operations  was  voted  yes  by  some  correct  scheduler  (both  opera¬ 
tions  got  a  majority  of  yes's),  the  SG  must  be  acyclic  and  the  result  is 
correct . 

The  principal  benefit  of  voting  is  fault  tolerance;  it  works 
correctly  as  long  as  a  majority  of  sites  holding  a  copy  of  x  are  run¬ 
ning.  See  ITHOM79.  GIFF79]  for  details. 


2.7  Multiversion  Data 

Let  us  return  to  a  database  system  model  where  each  logical  data 
item  is  stored  at  one  DM. 

In  a  multiversion  database  each  Write  w^[x],  produces  a  new  copy 
(or  version)  of  x,  denoted  x1.  Thus,  the  value  of  x  is  a  set  of  ver¬ 
sions.  For  each  Head,  r^[x],  the  scheduler  selects  one  of  the  versions 
of  x  to  be  read.  Since  writes  don't  overwrite  each  other,  and  since 
Reads  can  read  any  version,  the  scheduler  has  more  flexibility  in  con¬ 
trolling  the  effective  order  of  Reads  and  Writes. 

Although  the  database  has  multiple  versions,  users  expect  their 
transactions  to  behave  as  if  there  were  just  one  copy  of  each  data  item. 
Serial  logs  don't  always  behave  this  way.  For  example: 

V*°]  r!  »jtx1y1]  r2[x°y1]  w2[y2] 
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is  a  serial  log,  bnt  its  behavior  cannot  be  reproduced  with  only  one 
copy  of  z.  Ve  aust  therefore  restrict  the  set  of  allowable  serial  logs. 

A  serial  log  is  1-copy  serial  (or  1-serial)  if  each  r^x1*]  reads 
froa  the  last  transsction  preceding  it  that  wrote  into  any  version  of  x. 
The  above  log  is  not  1-serial,  because  reads  x  froa  Wq,  but 

^  log  1*  (1-SR)  if  it's  equivalent 
to  a  1-serial  log.  1-serializabil ity  is  our  correctness  criterion  for 
aultiversion  database  systeas. 

All  aultiversion  concurrency  control  algorithms  (that  we  know  of) 
totally  order  the  versions  of  each  data  itea  in  soae  simple  way.  A  ver¬ 
sion  order.  <<,  for  L  is  an  order  relation  over  versions  such  that,  for 
each  x,  <<  totally  orders  the  versions  of  x. 

Given  a  version  order  <<,  define  the  aultiversion  SG  w.r.t.  L  and 
<<  (denoted  MVSG(L, <<) )  as  SG(L)  with  the  following  edges  added:*  for 
each  rj[x^J  and  wk[xk]  in  L,  if  xk<<x^  then  include  Tk-*-Tj ,  else  include 
Tj-w-Tk. 

MULTIVERSION  THEOREM  [BERN81a],  A  aultiversion  log  is  1-SR  iff 
there  exists  a  version  order  <<  such  that  MVSG(L. << )  is  acyclic. 

U 

This  theorem  enables  us  to  prove  multiversion  concurrency  control 
algorithms  to  be  correct.  Ve  must  argue  that  for  every  log  L  produced 
by  the  algorithm,  MVSG(L, <<)  is  acyclic  for  soae  <<. 

The  types  of  aultiversion  schedulers  that  have  been  proposed  fall 
into  two  classes  that  approximately  correspond  to  timestamping  and  lock¬ 
ing. 


*Note  that  the  two  operations  conflict  (and  produce  an  edge  in  SG(L)) 
if  they  operate  on  the  saae  version  and  one  of  thea  is  a  write. 
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2.7.1  Nultiversion  Timestamping 

Nultiversion  concurrency  control  was  first  introduced  by  Reed  in 
bis  multiversion  timestamping  method  [R£ED78] .  In  Reed's  algorithm, 
each  transaction  had  a  unique  timestamp.  Each  Read  and  Vrite  carries 
the  timestamp  of  the  transaction  that  issued  it,  and  each  version  car¬ 
ries  the  timestamp  of  the  transaction  that  wrote  it.  The  version  is 
defined  by  x1<<xj  if  TS ( i) <TS ( j ) . 

Operations  are  processed  'first  come,  first  served'.**  However,  the 
version  selection  rules  ensure  that  the  overall  effect  is  as  if  opera¬ 
tions  were  processed  in  timestamp  order.  To  process  rj[x],  the 
scheduler  (or  DM)  returns  the  version  of  x  with  the  largest  timestamp  £ 
TS(i).  To  process  w^tx],  version  x1  is  created,  unless  some  Wj[x]  and 
r^[x]  have  already  been  processed  with  TS( j)<TS(i)<TS(k) .  If  this  con¬ 
dition  holds,  the  Vrite  is  rejected. 

An  analysis  of  MVSG(L, >>)  for  any  L  produced  by  this  method  shows 
that  every  edge  Ti-»-Tj  is  in  timestamp  order  (TS ( i) <TS ( j ) ) .  Thus 
MVSG(L,<<)  is  acyclic,  and  so  L  is  1-SR. 

2.7.2  Multiversion  Locking 

In  multiversion  locking,  the  Writes  on  each  data  item,  x,  must  be 
ordered.  We  define  x*<<xJ  if  w^ [xxJ <w^ [xJ ] .  Each  version  is  in  the 
certified  or  uncertified  stste.  When  a  version  is  first  written,  it  is 
uncertified.  Each  Read,  r^x],  read  either  the  last  (wrt<<)  certified 
version  of  z  or  snv  uncertified  version  of  x.  When  a  transaction  fin¬ 
ishes  executing,  the  database  system  attempts  to  certify  it.  To  certify 
Tj,  three  conditions  must  hold: 

Cl.  For  each  r.[x^],  x-*  is  certified. 

C2.  For  each  wi[xi],  all  x^  <<  x*  are  certified. 

C3.  For  each  w^lx1]  and  each  xJ  <<  x1,  all  transactions  that  read 
x^  have  been  certified. 

These  conditions  must  be  tested  atomically.  When  they  hold,  Tj  is 


**Handshaking  is  used  to  ensure  that  logically  conflicting  operations 
are  executed  by  DMs  in  the  order  the  scheduler  output  them. 
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declared  to  be  certified  and  all  versions  it  wrote  are  (atoaically)  cer¬ 
tified. 

An  analysis  of  MVSGiL, <<)  for  any  L  produced  by  thi s  Method  shows 
that  every  edge  T^-a-Tj  is  consistent  with  the  order  in  which  transac¬ 
tions  were  certified.  Since  certification  is  an  atomic  event,  the  cer¬ 
tification  order  is  a  total  order.  Thus,  MVSG(L,<<)  it  acyclic,  and  so 
L  is  1-SR. 

Two  details  of  the  algorithm  require  tone  discussion.  First,  the 
algorithm  can  deadlock.  For  example,  in  this  log: 

w0[x°]  rx [x°]  r2[x°]  Wjlx1]  w2lx2] 

Tj  and  T2  are  deadlocked  due  to  certification  condition  C3.  As  in  2 PL, 
deadlocks  can  be  detected  by  cycle  detection  on  a  waits-for  graph  whose 
edges  include  Tj-a-T^  such  that  is  waiting  for  Tj  to  becoae  certified 
(so  that  Tj  will  satisfy  C1-C3) . 

Second,  C1-C3  can  be  tested  atoaically  without  using  a  critical 
section.  Once  Cl  or  C2  is  satisfied  for  soae  r^lx^]  or  w^tx1],  no 
future  event  can  falsify  it.  Then  C3  becomes  true  for  soae  w^Ix*],  we 
'lock'  x*  so  that  no  future  reads  can  read  versions  that  precede  x*. 
This  allows  C1-C3  to  be  checked  one  data  itea  at  a  time.  Of  course,  the 
waits-for  graph  aust  be  extended  to  account  for  these  new  version  locks. 

Two  similar  multiversion  locking  algorithms  have  been  proposed 
which  allow  at  aost  one  certified  version  of  each  data  itea.  In 
Stearns'  and  Kosenkrantz' s  method  [STEA81],  the  waits-for  graph  is 
avoided  by  using  a  time  stamp-based  deadlock  avoidance  scheme.  In  Bayer 
et  al.'s  method  [BAYE80a,  BAYE80b] ,  a  waits-for  graph  is  used  to  prevent 
deadlocks.  This  algorithm  consults  the  waits-for  graph  before  selecting 
a  version  to  read,  and  it  always  selects  a  version  that  creats  no 
cycles. 

Multiversion  locking  algorithms  in  which  queries  (read-only  tran- 
aactions)  are  given  special  treatment  are  described  in  [CHAN82,  DUB 08 2 , 
BERN 82] 7 
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2.8  Coabining  tbe  Techniques 

The  techniques  described  in  Sections  2. 4-2. 8  can  be  coabined  in 

alaost  all  possible  ways.  The  three  basic  scheduling  techniques  (2PL. 

T/O.  SG  checking)  can  be  used  in  scheduler  node  or  certifier  node.  This 

gives  six  basic  concurrency  control  techniques.  Each  technique  can  be 

2 

used  for  n  ot  n  scheduling  or  both  (6  *=  36).  Schedulers  can  be  cen¬ 

tralized  or  distributed  (36  z  2  -  72).  and  replicated  data  can  be  han¬ 
dled  in  three  ways  (Do  Nothing,  Priaary  Copy,  Voting)  (72  x  3  =  216) . 
Then,  one  can  use  nul tiversions  or  not  (216  x  2  -  432),  By  considering 
the  multifarious  variations  of  each  technique,  the  number  of  distinct 
algorithms  is  in  the  thousands. 

To  illustrate  our  framework,  we  describe  some  of  the  algorithms 
that  already  have  appeared  in  the  literature. 

The  distributed  locking  algorithm  proposed  for  System  R*  uses  a  2PL 
scheduler  for  rw  and  ww  synchronization.  The  schedulers  are  distributed 
at  the  DMs.  Replication  is  handled  by  the  do  nothing  approach. 

Distributed  INGRES  uses  a  similar  locking  algorithm  [STON79] .  The 
main  difference  is  that  distributed  INGRES  uses  primary  copy  for  repli¬ 
cation. 

SDD-1  uses  conservative  T/0  for  rw  scheduling  and  Thomas'  write 
rule  for  ww  scheduling.  The  algorithm  has  distributed  schedulers  and 
takes  the  do  nothing  approach  to  replication  [BERN80b] .  SDD-1  also  uses 
conflict  araoh  analysis,  a  technique  for  presnalyzing  transactions  to 
determine  which  run-time  conflicts  need  not  be  synchronized. 

A  method  using  2PL  for  rw  scheduling  and  Thomas'  write  rule  for  ww 
scheduling  is  described  in  [BERN81b].  Distributed  schedulers  and  the  do 
nothing  approach  to  replication  were  suggested.  To  ensure  that  tbe 
locking  order  is  consistent  with  the  timestamp  order,  one  can  use  a  Lam¬ 
port  clock:  Each  message  is  timestamped  with  the  local  clock  time  when 
it  was  sent;  if  a  site  receives  a  message  with  a  timestamp,  TS,  greater 
than  its  local  clock  time,  the  site  pushes  its  clock  ahead  to  TS.  After 
a  transaction  obtains  all  of  its  locks,  it  is  assigned  a  timestamp  using 
the  TM's  local  Lamport  clock.  Thomas'  majority  consensus  algorithm  was 
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one  of  the  first  distribnted  concurrency  control  algorithaa.  It  uses  a 
2 PL  certifier  for  rw  scbednling  and  Thoaas'  write  role  for  ww  schedul¬ 
ing.  Schedulers  are  distributed  and  voting  is  used  for  replication. 
Each  transaction  is  assigned  a  tine stamp  froa  a  Laaport  clock  when  it  is 
certified.  This  ensures  that  the  certification  order  (produced  by  rw 
scheduling)  is  consistent  with  the  tiaestaap  order  used  for  ww  schedul¬ 
ing. 

Each  of  these  algorithas  is  quite  complex.  A  coaplete  treatment  of 
each  would  be  lengthy.  Yet,  by  understanding  the  basic  techniques  and 
how  they  can  be  correctly  coabined,  we  can  explain  the  essentials  of 
each  algoritha  in  a  few  sentences. 

Perforaance  of  these  algorithas  has  been  studied  in  [LIN81. 
LINN82a.  LBJN82b,  LINN82c,  LINN83 .  GARC78,  GAEC79.  GELE78] .  A 
coaprehensive  coaparison  of  these  algorithas  can  be  found  in  Section  4. 


.  .  -J 


Distributed  Database  System  Designer  Handbook 
Database  Recovery  Algorithms 


Page  3-1 
Section  3 


3.  Database  Recovery  Algorithms 


3.1  Introduction 

A  database  system  (DBS)  processes  read  and  vrite  commands  issued  by 
users'  transactions  to  access  the  database.  If  a  transaction  fails  in 
midstream,  or  if  the  system  fails,  the  database  may  be  left  in  an 
incorrect  state.  For  example,  if  a  money  transfer  transaction  fails 
after  posting  its  debit  but  before  posting  its  corresponding  credit, 
then  the  accounts  are  left  unbalanced.  The  recovery  algorithm  of  a  DBS 
avoids  these  incorrect  states  by  ensuring  that  the  database  only 
includes  updates  that  are  produced  by  transactions  that  execute  to  com¬ 
pletion.  This  section  is  a  survey  of  recovery  algorithms  for  central¬ 
ised  and  distributed  DBSs. 

Computer  systems  can  fail  in  many  ways,  only  some  of  which  are  han¬ 
dled  by  DBS  recovery  algorithms.  Ve  limit  our  attention  to  clean 
failures  in  which  a  transaction,  the  system,  or,  in  the  case  of  a  dis¬ 
tributed  DBS,  one  site  of  the  system,  simply  stops  running.  Ve  do  not 
consider  traitorous  failures  in  which  components  continue  to  run  but 
perform  incorrect  actions  (see  [DOLE82 ,  PEAS80]).  We  further  limit 
attention  to  soft  failures  in  which  the  contents  of  main  memory  are 
lost,  but  the  contents  of  secondsry  memory  (disk)  remain  intact.  Ve  do 
not  consider  methods  for  recovering  from  disk  failures,  although  methods 
similar  to  those  in  this  section  apply  (see  [GRAY81,  GRAT81 ,  HABD79, 
HAKD82 ,  LIND79.  LORI 77,  VERH78] ) . 

Ve  describe  a  model  of  centralized  DBS  recovery  in  Section  3.2.  Ve 
present  four  cannonical  types  of  centralized  DBS  recovery  algorithms  in 
Sections  3.3  through  3.8.  Ve  describe  recovery  algorithms  for  distri¬ 
buted  DBSs  in  Section  3.7. 
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3.2  A  Model  Of  Centralized  Database 
System  Recovery 


Ve  model  a  centralized  database  system  as  a  scheduler,  a  recovery 
system,  and  storage. 


Storaae 

The  storage  component  consists  of  bnf fer  storaae  and  stable 
storaae.  Both  are  divided  into  nhvsical  nazes  of  equal  and  fixed  size. 
Buffer  storage  models  main  memory.  Buffer  storage  is  relatively  fast, 
but  of  limited  capacity,  and  it  doesn't  survive  system  crashes.  Stable 
storage  models  disk  memory  and  it  is  relatively  slow,  of  (almost)  unlim¬ 
ited  capacity,  and  it  does  survive  crashes. 

The  database  consists  of  a  set  of  loaical  oases.  Ve  assume  that 
one  physical  copy  (usually  the  most  up-to-date  copy)  of  each  logical 
page  is  stored  in  a  portion  of  stable  storage  called  the  stable  data¬ 
base.  Other  portions  of  stable  storage  may  be  nsed  by  the  recovery  sys¬ 
tem  as  nonvolatile  scratch  space  in  way s  that  will  be  described  later. 

Transactions 

A  transaction  is  a  program  that  can  read  from  or  write  into  the 
database.  A  transaction  can  issue  four  types  of  commands:  Read,  Vrite, 
Commit,  and  Abort.  causes  a  page  to  be  read  from  the  database. 
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trite  censes  a  new  copy  of  a  logical  page  to  be  written  into  the  data- 
baee.  £SMil  telle  the  systea  that  the  transaction  has  teminated  and 
that  all  of  its  updated  pages  should  be  oernanentlv  reflected  in  the 
database.  Abort  tells  the  syetea  that  the  transaction  has  terainated 
abnormally  and  that  the  pages  it  wrote  into  should  be  returned  to  their 
previous  state.  (Coawit  and  Abort  aay  be  issued  by  a  process  control¬ 
ling  the  transaction,  rather  than  by  the  transaction  itself.)  A  transac¬ 
tion  can  have  only  one  Coaait  or  Abort  processed. 

A  transaction  is  active  if  it  hat  begun  executing  but  has  not  yet 
had  its  Coaait  or  Abort  processed. 

Notation:  Each  coaaand  is  subscripted  by  the  transaction  that 
issned  it.  For  exaaple.  Head^(Pj)  is  a  Head  issued  by  transaction  Tj  on 
page  j. 

Xfee  Sgfrodfflyr 

The  achednler  controls  the  order  in  which  Reads,  Writes,  Coaaits, 
and  Aborts  are  pussed  to  the  recovery  syetea.  Although  the  scheduler 
allows  coaaands  froa  different  transactions  to  be  interleaved,  it 
guarantees  that  the  resulting  execution  is  serialisable.  An  execution 
is  serialisable  if  the  effect  is  exactly  the  saae  as  if  the  transactions 
had  been  executed  serially,  one  after  the  next,  with  no  concurrency  at 
all.  Many  scheduling  algorithas  for  attaining  serialixability  are  dis¬ 
cussed  in  Section  2.  Versions  of  all  of  then  are  compatible  with  the 
recovery  algorithms  described  in  this  section. 

The  scheduler  also  guarantees  that  the  execution  is  recoverable. 
An  execution  is  recoverable  if,  for  each  transaction  Tj,  TA  is  not  com¬ 
mitted  until,  for  each  page  read  by  T^,  the  transaction  that  last  wrote 
that  page  is  committed. 

Recoverability  is  needed  to  avoid  errors  such  as  the  following. 
Suppose  Ta  reads  a  page  Pk  last  written  by  Tj  (which  is  etill  active), 
Tj  writes  another  page  Pj,  and  commits.  Now,  suppose  Tj  fails  and  is 
aborted.-  Aborting  Tj  causes  its  write  on  P^  to  be  undone,  thereby 
rendering  T^'s  input  invalid.  But,  since  cannot  be  aborted  after 
having  been  committed,  Tj'a  updates  to  Pj  must  remain  in  the  database 
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even  though  its  input  ?k  is  invalid. 

For  definitions,  ve  assume  that  the  scheduler  uses  nate-level  two- 
phase  locklns  (2PL)  [EAGES1 ] .  Before  outputting  Beadi(Pj)  (resp. 
WritejCPj)),  the  scheduler  sets  a  read  lock  (resp.  vrite  lock)  on  page 
Pj  for  transaction  T^.  Two  transactions  cannot  concurrently  own  con¬ 
flicting  locks  on  the  same  page,  vhere  read  locks  conflict  with  vrite 
locks  and  vrite  locks  conflict  vith  read  and  vrite  locks.  If  the 
scheduler  receives  an  operation  for  vhich  it  can't  set  the  corresponding 
lock,  it  delays  the  operation  until  the  lock  can  be  set. 

When  the  scheduler  receives  a  Commit ^  or  an  Abort^,  it  forvards  the 
operation  directly  to  the  recovery  system.  Then  the  recovery  system 
acknovledges  that  the  operation  has  been  processed,  the  scheduler  then 
relesses  all  the  locks  held  by  T^. 

Two-phase  locking  ensures  serializability  (see  [BEKN82,  ESVA76]  for 
proofs).  The  version  of  2PL  presented  above  also  ensures  recoverability 
by  requiring  that  a  transaction  hold  its  vrite  locks  until  its  Commit  or 
Abort  is  processed. 

2M  Recovery  Sy*Ieg 

The  recovery  system  processes  the  Read,  Trite,  Commit,  and  Abort 
commands  it  receives  from  the  scheduler.  It  also  handles  system 
failures. 

A  system  failure  can  interrupt  the  DBS  at  any  moment.  It  causes 
all  processing  to  stop  and  the  contents  of  buffer  storage  to  be  lost. 
After  the  system  recovers,  transactions  tbst  were  active  at  the  time  of 
the  failure  cannot  continue  executing  because  the  contents  of  main 
memory  are  now  useless.  Thus,  after  the  failure  and  before  processing 
any  other  commands,  the  recovery  system  processes  the  restart  command, 
vhose  effect  is  to  abort  all  active  transactions. 

To  handle  failures  properly,  it  is  essential  that  the  Commit  com¬ 
mand  be  implemented  in  a  single  instruction,  normally  a  page  vrite.  If 
it  vere  to  require  more  than  one  instruction,  a  system  failure  could 
interrupt  a  partially  completed  Commit,  making  it  ambiguous  whether  the 
transaction  should  be  aborted  during  restart.  Said  differently,  each 
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transaction  anst  always  be  in  one  of  three  states:  active,  committed,  or 
aborted,  and  each  state  change  anst  be  iapleaented  by  an  atoaic  instruc¬ 
tion  execution. 

Structuring  Scratch  Space 

There  axe  several  types  of  inforaation  that  a  recovery  algorithm 
stores  in  stable  scratch  space.  It  nay  store  the  identifiers  of  tran¬ 
sactions  that  have  coaaitted,  called  the  coaait  list.  In  this  case,  the 
single  instruction  that  iapleaents  Coaait^  is  usually  a  write  that  adds 
Tj  to  the  coaait  list.  The  recovery  algorithm  also  aay  store  a  list  of 
identifiers  of  transactions  that  are  active,  called  the  active  list,  and 
those  that  have  aborted,  called  the  abort  list. 

Recovery  algorithms  often  store  copies  of  pages  that  were  recently 
written  on  an  audit  trail  (soaetiaes  called  a  journal  or  log) .  For  each 
write  processed  by  the  recovery  algorithm,  the  audit  trail  aay  contain 
the  identifier  of  the  transaction  that  performed  the  write,  a  copy  of 
the  newly  written  page  (called  an  after-laaae) .  and  a  copy  of  the  physi¬ 
cal  page  in  the  stable  database  that  was  overwritten  by  the  write 
(called  a  before- image) .  Different  algorithms  vary  considerably  in  the 
inforaation  they  keep  on  the  audit  trail  and  in  how  they  structure  that 
inforaation. 

Pndo  and  Redo 

Recovery  algorithms  also  differ  in  the  time  at  which  they  write 
pages  into  the  stable  database.  They  aay  perform  such  writes  before, 
concurrently  with,  or  after  the  atonic  instruction  that  commits  the 
transaction  that  last  wrote  those  pages. 

Suppose  that  a  page  written  by  an  active  transaction  is  written 
into  the  stable  database  before  the  transaction  commits.  If  the  tran¬ 
saction  aborts  due  to  a  aystea  or  transaction  failure,  the  recovery 
algorithm  aust  undo  the  write  by  restoring  the  previous  copy  (before- 

iaage)  of  the  page. 


t 


Suppose  that  a  page  written  by  an  active  transaction  is  not  written 
into  the  stable  database  before  the  transaction  coaaits.  If  a  system 
failure  occurs  after  the  transaction  coaaits  but  before  the  page  is 
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written  into  the  stable  database,  the  recovery  algorithm  must  redo  the 
write  by  moving  the  page  to  the  stable  database. 

In  every  recovery  algorithm,  the  after-images  produced  by  a  tran- 
aaction  must  be  written  to  stable  storage  (the  database  or  scratch 
space)  before  the  transaction  commits.  This  is  called  the  comait  rule. 
If  it  is  violated,  a  system  failure  shortly  after  a  transaction  com¬ 
mits  could  leave  the  recovery  algorithm  with  no  stable  copy  of  T^'s 
after-images,  making  it  impossible  to  redo  T^. 

Every  recovery  algorithm  must  also  obey  the  loa  ahead  rule:  if  an 
after-image  is  written  to  the  stable  database  before  the  transaction 
that  wrote  it  commits,  then  the  before-image  of  that  page  must  first  be 
written  to  the  audit  trail.  Otherwise,  a  system  failure  could  occur 
after  the  after-image  is  in  the  stable  database  but  before  the  before¬ 
image  is  in  the  audit  trail,  in  which  case  the  write  could  not  be 
undone . 

CAtSAOriaatgon  of  Refio.yeyy  A1  nor i thus 

Recovery  algorithms  can  be  categorized  based  on  the  timing  of 
updates  to  the  stable  database.  There  are  four  types  of  recovery  algo¬ 
rithms  Some  may  require  undo  but  not  redo,  redo  but  not  undo,  both  undo 
and  redo,  and  neither  undo  nor  redo.  These  types  of  algorithms  are 
described  in  Sections  3. 3-3. 6. 


3.3  Algorithms  That  Undo  But  Don't  Bedo 

For  each  type  of  recovery  algorithm,  we  present  a  generic  algorithm 
based  on  our  database  system  model,  and  then  we  list  example  implementa¬ 
tions.  Ve  describe  this  generic  version  by  explaining  how  each  command 
is  processed.  In  all  of  the  algorithms,  the  first  command  processed  for 
T^  should  add  T^  to  the  active  list. 

For  each  operation,  ve  mark  by  '{Ack}'  the  point  at  which  the 
recover/  system  can  acknowledge  to  the  scheduler  that  the  operation  has 
been  completed.  Sometimes  the  operation  hat  additional  work  to  do  after 
the  acknowledgement  it  sent. 


*•'  **  Jk a»w', 
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Eesdi(Pj).  Copy  Pj  from  the  stable  database  into  a  buffer.  (Ack) 

frite,(P^).  Copy  the  before-iaage  of  P-  (froa  the  stable  database) 
to  the  audit  trail.  {Ack}  Then*  (after  the  disk  acknowledges  the  write 
in  the  audit  trail),  write  the  new  copy  of  Pj  into  the  stable  database. 

Coaait^.  Make  sure  all  pages  written  by  Tj  are  in  the  stable  data¬ 
base.  Then  write  into  the  coaait  list.  {Ack}  Then  delete  it  froa 
the  active  list. 

Ab orti.  Vrite  into  the  abort  list.  Then  undo  all  of  Tj's 
writes  by  reading  their  before-iaages  froa  the  audit  trail  and  writing 
them  back  into  the  stable  database.  {Ack}  Then,  delete  T^  froa  the 
active  list. 

Restart.  Process  Abort^  for  each  T^  on  the  active  list.  {Ack} 

In  this  algorithm,  all  pages  written  by  a  transaction  are  written 
into  the  stable  database  before  the  transaction  coaait s.  Thus,  redo  is 
never  needed,  but  an  abort  aay  require  undo. 

It  is  actually  not  necessary  to  write  an  after-iaage  into  the 
stable  database  |aaji|1  iff  n)  j  after  the  before-iaage  is  written  into  the 
audit  trail.  The  after-iaage  could  be  left  in  buffer  storage  for 
awhile,  provided  it  is  written  to  the  stable  database  before  the  tran¬ 
saction  coaaits  as  required  by  the  coaait  rule. 

This  algorithm  obeys  the  log  ahead  rule  in  processing  WritOj(Pj); 
the  before-iaage  of  Pj  is  written  to  the  audit  trail  before  the  after- 
iaage  is  written  to  the  stable  database. 

The  order  in  which  writes  are  applied  to  stable  storage  is  quite 
sensitive  in  this  (and  aost  other)  recovery  algorithas.  In  this  algo- 
ritha,  for  ezaaple,  in  processing  coaait^  it  is  incorrect  to  delete  Ti 
froa  the  active  list  before  writing  it  into  the  coaait  list. 

Remember  that  a  system  failure  can  occur  during  the  processing  of  a 
Restart.  So  Restart  aust  also  take  care  tc  reload  the  current  active 


*  In  every  algor itha,  we  use  'then'  to  ae an  'wait  for  the  previous  step 
to  complete  before  proceeding  to  the  next  step'. 


Page  3-8 

Section  3 


Distributed  Databete  System  Designer  Handbook 
Database  Eecovery  Algorithms 


list  into  stable  storage  in  order  that  it  will  be  resilient  to  an  system 
failure  (followed  by  another  Eestart). 

After  Commit ^  or  Abort^  has  been  processed,  the  andit  trail 
copies  of  pages  written  by  are  no  longer  needed  and  can  be  returned 
to  free  space.  The  algorithm  for  garbage  collecting  these  audit  trail 
pages  depends  principally  on  the  audit  trail's  data  structure.  Ve  will 
not  discuss  garbage  collection  issues  for  any  of  the  recovery  methods 
described  in  this  section. 

The  Prime  Algorithm 

This  type  of  recovery  algorithm  is  used  in  a  database  system  pro¬ 
duct  offered  by  Prime  Computers  [DUB082],  and  in  the  DDM  database  system 
being  developed  at  CCA  [&1ES82]. 

In  Prime's  algorithm,  each  page  in  the  stable  database  has  a 
pointer  to  its  before-image  in  the  audit  trail.  Each  before-image  in 
the  audit  trail  points,  in  turn,  to  the  next  older  before-image  of  the 
same  page.  Also,  each  physical  page  carries  the  transaction  identifier 
of  the  transaction  that  wrote  that  particular  copy.  And,  for  each 
active  transaction  there  is  a  convenient  way  to  obtain  a  list  of  all 
pages  it  has  written. 

The  page  pointers  are  used  for  two  purposes.  First,  to  process  an 
Abort,  the  pointer  in  each  stable  database  page  makes  it  easy  to  undo 
the  aborted  transaction's  writes.  Second,  they  help  avoid  concurrency 
control  conflicts  between  queries  and  updates,  as  follows. 

A  ouerv  is  a  read-only  transaction.  Heads  issued  by  queries  are 
not  locked  in  the  scheduler  but  are  passed  directly  to  the  recovery  sys¬ 
tem  (without  being  delayed).  When  the  recovery  algorithm  receives  the 
first  read  issued  by  a  query  T^»  say  Kead^(Pj),  it  reads  the  commit  list 
and  then  selects  the  newest  copy  on  the  chained  list  of  Pj  oopies  whose 
transaction  identifier  is  on  the  commit  list.  Subsequent  reads  by  T^ 
are  processed  in  the  same  way,  using  the  copy  of  the  commit  list  that 
was  read  when  the  first  Eead^  was  processed.  By  reading  in  this  way, 
queries  see  a  consistent  copy  of  the  database,  yet  they  do  not  set  read 
locks  that  might  delay  update  transactions. 

\ 
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Another  ando/no-redo  s Igor  it  ha  is  described  in  [BAPP75]. 


3.4  Algorithms  That  Bedo  But  Don't  Undo 

In  the  generic  algorithm,  each  coauund  is  processed  as  follows. 

Bead^(Pj).  If  previously  wrote  P y  then  copy  the  after-iaage  of 
Pj  into  a  buffer.  Otherwise,  copy  frost  the  stable  database  into  a 
buffer.  (Ack} 

Write- (P  ).  Write  the  new  value  of  P-  into  the  audit  trail.  {Ack} 
*  J  J 

Costa  i  t  ^ .  Write  into  the  conn  it  list.  Then  for  each  page  writ¬ 
ten  by  T^.  copy  the  after-iaage  froa  the  audit  trail  into  the  stable 
database.  (Ack)  Then  delete  T^  froa  the  active  list. 

Abort Write  T^  into  the  abort  list.  {Ack}  Then  delete  it  froa 
the  active  list. 

Bestart.  For  each  Tj  that  is  on  the  active  list  but  not  on  the 
coaait  list,  process  Abort^.  {Ack}  For  each  Tj  on  the  active  list  and 
the  coaait  list,  process  Coaaitj. 

In  this  algorithm,  pages  written  by  a  transaction  are  not  written 
into  the  atable  database  until  after  the  transaction  coaaits.  Thus, 
undo  is  never  needed,  but  a  Bestart  aay  require  redo. 

This  algorithm  obeys  the  coaait  rule  because  the  after-iaage  of 
pages  written  by  T^  are  stored  on  the  audit  trail  before  T^  coaaits.  It 
also  obeys  the  log  ahead  rule,  since  no  after-iaage  of  a  transaction  is 
written  into  the  stable  database  before  it  coaaits. 

lapleaentations  of  this  algoritha  are  described  in  [LAMP76, 
MENA79].  This  type  of  recovery  algoritha  is  used  in  the  INGRES  Database 
Systea  [ST0N79]  and  in  SDD-1  [BEBN80b] . 
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3.3  Algorithms  That  Redo  And  Undo 

In  this  algorithm,  coaimands  are  processed  as  follows. 

Read^tPj).  If  T^  previously  wrote  P^ ,  then  copy  the  after-image  of 

P-  into  a  buffer.  Otherwise,  copy  P.  from  the  stable  database  into  a 
J  J 

buffer.  (Ack) 

Writei(PJ).  Copy  the  before-image  and  the  after-image  of  Pj  into 
the  audit  trail.  {Ack}  Then,  sometime  later,  write  the  after-image  into 
the  stable  database. 

Commit^.  Write  into  the  commit  list.  Then,  for  each  page  writ¬ 
ten  by  T^,  write  the  after-image  into  the  stable  database  (if  it  hasn't 
already  been  done).  (Ack)  Then,  delete  T^  from  the  active  list. 

Abort^.  Write  TA  into  the  abort  list.  Then,  for  each  page  written 
by  T^,  if  its  after-image  has  already  been  written  into  the  stable  data¬ 
base,  write  its  before-image  into  the  stable  database.  {Ack}  Then 
delete  TA  from  the  active  list. 

Restart.  For  each  T^  on  the  active  list  and  the  commit  list,  pro¬ 
cess  Commit^.  For  each  T^  on  the  active  list  but  not  on  the  commit 
list,  process  Abort^.  (Ack} 

Note  that  Abort  may  require  undo  and  Restart  may  require  redo. 

This  algorithm  obeys  the  commit  rule,  since  the  after-image  of  each 
page  written  by  T^  is  written  into  the  audit  trail  before  T^  commits. 
It  also  obeys  the  log  ahead  rule,  since  the  before-image  of  each  page 
written  by  TA  is  written  into  the  stable  database. 

One  can  improve  the  performance  of  this  algorithm  by  using  a  varia¬ 
tion  proposed  by  Gray  [GRAY81],  Gray’s  algorithm  processes  commands  as 
follows. 

Read^tPj).  If  Tj  previously  wrote  P y  check  to  see  if  the  after¬ 
image  is  in  buffer  storage.  If  not,  copy  Pj  from  the  stable  database  to 
a  buffer.  {Ack} 

Write^(Pj).  Copy  the  before-image  of  Pj  into  buffer  storage  unless 
it  is  already  there.  Write  the  after-image  of  P,  into  buffer  storage; 

I 

L*  > 
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Another  undo/no-redo  algor itkn  is  described  in  [HAPP75 ] . 


3.4  Algorithms  That  Bedo  But  Don't  Dado 

In  tbe  generic  algorithm,  aach  command  is  processed  as  follows. 

Kead^(Pj).  If  Tj  previously  wrote  Pj,  then  copy  the  after-image  of 
Pj  into  a  buffer.  Otherwise,  copy  Pj  from  the  stable  database  into  a 
buffer.  {Ack} 

Write^(Pj).  Write  the  new  value  of  P^  into  the  audit  trail.  (Ack) 

Commit Write  into  the  commit  list.  Then  for  each  page  writ¬ 
ten  by  Tj,  copy  the  after-image  from  the  audit  trail  into  the  stable 
database.  {Ack}  Then  delete  T^  from  the  active  list. 

Aborti.  Write  Tj  into  the  abort  list.  {Ack}  Then  delete  it  from 
the  active  list. 

Bestart.  For  each  that  is  on  the  active  list  but  not  on  the 
commit  list,  process  Abort^.  (Ack}  For  each  Tj  on  the  active  list  and 
the  commit  list,  process  Couitj. 

In  this  algorithm,  pages  written  by  a  transaction  are  not  written 
into  the  stable  database  until  after  the  transaction  commits.  Thus, 
undo  is  never  needed,  but  a  Bestart  may  require  redo. 

This  algorithm  obeys  the  coawit  rule  because  the  after-image  of 
pages  written  by  Tj  are  stored  on  the  audit  trail  before  Tj  commits.  It 
also  obeys  the  log  ahead  rule,  since  no  after-image  of  a  transaction  is 
written  into  the  stable  database  before  it  commits. 

Implementations  of  this  algorithm  are  described  in  [LAMP76. 
MENA79] .  This  type  of  recovery  algorithm  is  used  in  the  INGRES  Database 
System  [ST0N79]  and  in  SDD-1  [BEBN80b] . 
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this  step  asst  not  overwrite  the  before- image.  {Ack}  Sometime  later, 
write  the  before-image  into  the  audit  trail,  leaving  a  copy  of  the 
after-image  in  buffer  storage.  Tbe  after-image  may  be  written  into  the 
stable  database  any  time  after  the  before-image  is  written  into  the 
audit  trail.  Once  the  after-iasge  is  written  both  to  the  audit  trail 
and  the  stable  database,  it  may  be  removed  from  buffer  storage. 

Commit^.  After  all  the  after-images  of  pages  written  by  T^  have 
been  written  into  the  audit  trail,  write  into  the  commit  list.  {Ack} 

Abort^  and  Kestart  are  the  same  as  the  generic  algorithm. 

This  algorithm  obeys  the  log  ahead  rule  because  the  before-image  of 
each  page  is  written  in  the  audit  trail  before  the  after-image  is  writ¬ 
ten  in  the  stable  database.  The  commit  rule  is  also  satisfied  since 
T^'s  after-images  are  written  into  the  audit  trail  before  T^  commits. 

When  all  after-images  written  by  T^  have  been  written  into  the 
stable  database,  T^  can  be  deleted  from  the  active  list.  This  tells 
Bestart  that  Tj  does  not  need  to  be  redone. 

The  main  benefit  of  this  algorithm  is  that  the  decision  to  write 
pages  into  stable  storage  is  usually  left  to  the  database  system's 
buffer  management  algorithm.  The  recovery  algorithm  writes  into  stable 
storage  only  when  the  commit  or  log  ahead  rule  requires  it. 

A  detailed  implementation  of  this  algorithm  that  incorporates 
checkpoints,  and  in  which  transactions  write  records  instead  of  entire 
pages,  appears  in  [LIND79] . 


3.6  Algorithms  That  Don’t  Undo  Or  Bedo 

In  the  generic  algorithm,  each  command  is  processed  as  follows. 

Bead^CPj).  If  Tj  previously  wrote  Pj,  then  copy  the  after-image  of 
Pj  into  a  buffer.  Otherwise,  copy  Pj  from  the  stable  dstabase  into  a 
buffer.  (Ack) 
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Irite^Pj).  Write  the  after-image  of  Pj  into  the  audit  trail. 

{AckJ 

Commit^.  In  a  single  instruction,  write  the  after-images  of  all 
pages  written  by  into  the  stable  database  and  delete  from  the 

active  list.  {Ack} 

Abort i#  Write  into  the  abort  list.  {Ack)  Then  delete  it  from 
the  active  list. 

Restart.  For  each  on  the  active  list,  process  Abort^.  {Ack} 

Unfortunately,  this  description  isn't  very  informative  becauae  it 
relies  on  a  magical  instruction  that  implements  commit  without  even 
using  a  commit  list.  Notice  that  if  the  magical  instruction  is  avail¬ 
able,  then  undo  isn't  needed  because  a  transaction's  after-images  are 
not  written  into  the  stable  database  before  it  commits,  and  redo  isn’t 
needed  because  a  transaction's  after-images  are  written  into  the  stable 
database  in  the  instruction  that  commits  the  transaction. 

We  will  describe  an  implementation  of  the  Commit  instruction  simi¬ 
lar  to  one  presented  in  [LORI77] . 

Lorie*  s  Shadow  Paae  ,*1  »nri  tftm 

Assume  that  the  stable  database  is  partitioned  into  files 
{F1(...,F2},  each  of  which  is  a  sequence  of  logical  pages.  Each  file, 
Fj ,  has  a  paae  table.  PTj,  whose  entries  point  to  the  pages  of  Fj.  That 
is,  PTj[k3]  contains  the  address  of  the  k-th  pate  of  Fj;  this  page  is 
denoted  Pjk.  Assume  that  each  page  table  fits  on  one  page  in  the  stable 
database.  The  stable  database  also  contains  in  a  fixed  address  a  master 
record.  M,  that  points  to  the  n  page  tables;  M[j]  contains  the  address 
of  PTj . 

Abort  and  Restart  are  processed  as  in  the  generic  algorithm.  Read, 
Write,  and  Commit  are  processed  as  follows. 

For  each  file,  Fj,  the  first  Read  or  Write  that  T|  issues  on  a  page 
of  Fj  causes  the  recovery  algorithm  to  make  a  copy  of  PTj  in  buffer 
storage,  denoted  FTj^  For  each  page  Pjk  that  Tt  writes,  PTjA[k]  will 
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point  to  the  after- inage  of  that  page  in  tbe  audit  trail.  (The  other 
entries  in  PTj ^  are  irrelevant.) 

Eead^(Pj^).  If  Tj  previously  wrote  Pj.  then  copy  the  after-inage 
of  Pj  froa  address  PTjklk3]  into  a  buffer.  Otherwise,  use  M  to  find  PTj 
and  copy  Pjk  froa  address  PTj[k3]  in  the  stable  database  into  a  buffer. 
Uck) 

friteA(Pj).  Vrite  the  new  copy  of  Pj^  into  the  audit  trail.  Then 
assign  PTj^Lk]  the  address  of  that  audit  trail  page.  (Ack) 

Commit^.  Copy  M  into  buffer  storage.  For  each  file  Fj  that  Tj 
wrote  into,  use  (the  buffer  copy  of)  M  to  find  PTj  and  copy  it  into  an 
eapty  page  of  buffer  storage.  (There  are  now  two  page  tables  for  Fj 
connected  to  T^:  the  buffer  copy  of  PTj  that  was  just  read,  and  PTj ^ . ) 
For  each  page  P^  that  was  written  by  T^,  assign  to  the  buffer  copy  of 
PTj[k3]  the  contents  of  PTj^Ik].  Then,  write  PTj  into  a  new  location  in 
scratch  space;  denote  this  new  copy  of  PTj  by  PTj.  Then,  for  each  Fj 
that  Tj  wrote  into,  assign  to  (the  buffer  copy  of)  M[j3]  the  address  of 
FT'..  Then  write  M  back  to  its  fixed  address  in  stable  storage.  (Ack) 

The  coausit  algorithm  prepares  a  scratch  copy  of  the  page  table 
(PTj).  This  is  accomplished  by  assigning  to  N[j]  the  address  of  PTj  for 
each  file  Fj  that  T^  wrote.  By  writing  H  back  to  the  stable  database, 
the  old  copies  of  the  page  table  (PTj)  are  replaced  by  the  new  ones 
(PTj). 

The  instruction  that  commits  Tj  is  the  one  that  writes  the  updated 
II  back  into  the  stable  database.  Before  this  write,  any  read  will  use 
the  old  copy  of  M  to  read  the  before-image  of  any  page  written  by  Tj. 
After  this  write,  it  will  read  the  after-inage  of  any  such  page. 

The  recovery  algorithm  can  commit  only  one  transaction  at  a  time. 
That  is.  Commit  is  a  critical  section.  If  two  transactions  were 
(incorrectly)  to  commit  concurrently,  each  transaction  might  read  a  copy 
of  PTj  into  buffer  storage,  change  the  pointers  to  pages  it  wrote,  and 
write  that  copy  of  PTj  to  the  audit  trail.  Thus,  two  copies  of  PTj 
would  exist.  Whichever  transaction  updated  II  first  would  lose  its 
updates  to  PTj,  since  they  would  be  overwritten  by  the  second  transac- 
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tion  when  it  installed  its  copy  of  PT  j  by  updating  M. 

A  version  of  Lorie's  algoritka  is  iapleaented  in  Systea  B's 
recovery  aanager  [GRAY81]. 


3.7  Recovery  In  A  Distributed  Database  Systea 

A  distributed  database  systea  (DDBS)  consists  of  a  set  of  sites 
connected  by  a  network.  Each  transaction  can  read  or  write  data  stored 
at  any  of  the  sites. 

We  aodel  a  DDBS  by  a  set  of  processes  called  data  nodules  (DMs)  and 
transaction  aodnles  (IBs).  A  DM  is  a  centralized  database  systea  as 
defined  in  Section  3.2.  It  processes  Beads  and  Writes  on  pages  stored 
at  that  DM.  It  also  processes  Comaits  and  Aborts,  which  peraanently 
install  or  undo  the  writes  of  a  transaction  at  that  DM. 

A  TM  interfaces  transactions  and  DMs.  Each  transaction.  Tj,  sub- 
aits  commands  to  one  TV,  say  TM#.  To  process  Bead^  or  Write^,  IB#  sin- 
ply  sends  the  coaaand  to  the  DM  that  stores  the  date  being  read  or  writ¬ 
ten.  Let  Active^  be  the  DMs  at  which  T^  was  active.  To  process  Abort^, 
TMt  aust  ensure  that  every  DM  in  Active^  processes  Abort^.  To  process 
Coamit^,  TBa  should  try  to  ensure  that  every  DM  in  Active^  processes 
Commit^. 

Unfortunately,  IBs  and  DMs  aay  fail  at  unpredictable  tines.  TBs 
aust  process  coaaands  so  that  such  failures  never  cause  it  to  produce 
incorrect  results. 

We  assume  that  process  (i.e.,  IB  and  DM)  failures  are  'clean’.  If 
a  process  does  not  produce  an  expected  response  to  a  message  within  a 
tiaeout  period,  then  the  process  has  really  failed.  If  one  process 
believes  another  process  is  down,  then  all  processes  believe  that  the 
process  is  down.  And,  when  a  process  recovers,  it  recognizes  that  it 
has  just  recovered  froa  a  failure  and  runs  a  special  'reintegration  pro¬ 
tocol’."-  Mechanisas  that  support  these  assuaptions  are  beyond  the  scope 
of  this  section.  (See  [ATTAS2,  HAMM80,  P ARIA 2 ,  WALT82] . ) 
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Each  TM  keeps  an  active  list,  commit  list,  and  abort  list  in  stable 
atorage.  And,  for  each  on  the  active  list,  it  maintains  Active^  in 
stable  atorage.  When  it  receives  a  Head  or  Write  from  T^,  it  sends  the 
command  to  the  appropriate  DM  and  adds  that  DM  to  Active^  For  the 
first  snch  Bead  or  Write,  the  TM  also  adds  Tj  to  its  active  list.  It 
processes  Abort  and  Commit  aa  follows. 

Abort^.  Add  Tj  to  the  abort  list.  Then,  send  Abort^  to  each  DM  in 
Active^.  Wait  for  every  DM  to  acknowledge  that  it  processed  Abort^. 
{Ack}  Delete  from  the  active  list. 

Commit^.  Add  to  the  commit  list.  Then,  send  Commit^  to  each  DM 
in  Active^.  Wait  for  every  DM  to  acknowledge  that  it  processed  Commit^. 
{Ack}  Delete  Tj  from  the  active  list. 

If  a  III  fails  and  later  restarts,  then  it  processes  a  Bestart  in 
the  nsnal  way.  For  each  on  both  the  commit  list  and  the  active  list, 
process  Commit^.  For  all  other  on  the  active  list,  process  Abort^. 

If  a  TM,  say  Tk^,  discovers  that  a  DM,  say  DM^,  has  failed,  then  it 
normally  processes  Abort^  for  each  Tj  that  has  DMb  in  Active^.  But  what 
if  DM^  is  in  Active^  and  TM^  has  already  sent  Commit^  to  other  DMs  in 
Active^?  In  this  case  it  can't  abort  because  Ti  already  may  be  com¬ 
mitted  at  some  DMs.  Instead,  it  must  wait  for  DM^  to  recover.  When  it 
does,  TMa  sends  Commit^  to  DM^,  too. 


3.8  Two-Phase  Commit 

Each  TM  must  obey  the  commit  rule.  That  is,  it  must  not  send 
Commit^  to  any  DM  until  every  DM  in  Active^  has  T^’s  after-images  on 
stable  storage.  Otherwise,  a  DM  in  Active^  may: 

1.  Fail  before  receiving  Commit^ 

2.  Upon  recovering,  discover  from  TM(  that  Tj  has  committed 

3.  But  be  unable  to  process  Commit ^  because  it  lost  some  of  Tj's 
after-images  due  to  the  feilure 
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To  obey  the  coaalt  role  and  thereby  prevent  (3),  IMa  can  use  the 
two-phase  coaai t  protocol  for  processing  Coaait  coaaands  [LAMP76] . 
Phase  one  begin*  when  TM#  receives  Coaait^.  It  then  sends  a  coaaand 
called  Bnd^  to  each  DM  in  Active^.  A  DM  processes  End  ^  by  first  ensur¬ 
ing  that  T^'s  after-iaages  at  that  DM  are  on  stable  storage  and  then 
sending  an  ackaowledgeaent  to  TMa.  When  TM#  has  received  the  ack- 
nowledgeaent  froa  every  DM  in  Active^,  phase  one  is  done.  {Ack}  In 
phase  two,  lk|  sends  Coaait^  to  each  DM. 

Since  does  not  send  Coaait^  to  anv  DM  until  every  DM  has  ack¬ 
nowledged  End^,  no  DM  in  Active^  will  process  Coaait^  until  every  DM  has 
T^'s  after-iaage  on  stable  storage. 

If  a  DM,  say  DM^ ,  fails  before  acknowledging  Endj,  then  TMa  won't 
leave  phase  one.  Since  TMa  cannot  be  sure  that  DMfc  will  be  able  to 
process  Coatait^  when  it  recovers,  lMa  aust  either  wait  for  DMb  to 
recover  or  abort  Tj  by  sending  Abort^  to  every  DM  in  Active^.  In  prac¬ 
tice,  T3ta  siaply  waits  a  prespecified  tiaeout  period  after  distributing 
the  End^'s;  if  it  hasn't  received  an  ackaowledgeaent  of  sone  End^  by 
this  tiae,  it  assuaes  the  DM  has  failed  and  aborts  Ti. 

Until  a  DM  processes  End^,  it  nay  unilaterally  decide  to  abort 
by  sending  an  Abort^  coaaand  to  TM^ .  Once  a  DM  acknowledges  End^,  it 
loses  its  right  to  unilaterally  abort  T^,  and  aay  only  abort  Tj  if 
directed  to  do  so  by  TMa. 


3.9  Three-Phase  Coaait 

The  TM  algoritha  presented  above  has  a  serious  disadvantage.  Sup¬ 
pose  TMa  sends  End^  to  DM^,  DMfe  acknowledges  End^,  and  then  Hfa  fails. 
Since  DM^  doesn't  know  whether  T^  will  coaait  or  abort,  it  has  to  wait 
for  Wa  to  recover.  In  particular,  it  aust  hold  T^'s  locks  until  TMa 
recovers.  If  TMa  is  supervising  nany  active  transactions,  large  por¬ 
tions  of  the  database  aay  be  locked  and  unavailable  until  TMg  recovers. 

Ve  can  avoid  this  problea  by  providing  each  TM  with  one  or  aore 
backup  TMs.  If  a  TM  fails,  the  backups  can  take  over  its  functions. 
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One  encb  algorithn  Is  three-phase  coaait  [SKEE81a,  SKEE82a, 
SKEE82b.  S^£E81b].  Each  backup  for  TM#  aaintsins  a  couit  list,  d|t 
To  process  Coaait^  TMa  behaves  as  follows. 

1.  IK,  sends  Endi  to  each  DM  in  Active^.  Then  it  waits  for  all  DMs  to 
acknowledge  their  End^'s. 

2.  lMa  sends  a  conns nd  called  Precoaait^  to  each  backnp  TV.  A  TM 
processes  Precoaait^  by  adding  to  its  copy  of  CLa,  and  then  send¬ 
ing  an  acknowledgeaent  to  Hla.  Hla  waits  for  all  backups  to  ack¬ 
nowledge  Precoaaiti. 

3.  HI.  sends  Coaait^  to  each  DM  in  Active^. 

Essentially,  this  is  the  two-phase  coaait  protocol  with  a  new  phase 
added  (Step  2). 

If  a  backnp  HI  fails.  TMa  can  ignore  the  failure  if  the  nuaber  of 
backups  is  still  acceptably  large;  otherwise,  it  should  acquire  another 
backup  111  to  replace  the  failed  one. 

Suppose  llla  fails.  When  the  backups  discover  the  failure,  they 
elect  one  of  their  aeaber  TMs,  say  Hf^,  to  replace  Hla.  After  TM^  is 
elected,  every  other  backup  HI  sends  its  copy  of  GLa  to  TM^.  H^  takes 
the  union  of  those  copies  and  distributes  the  result  to  other  backups. 
This  becoaes  everyone's  copy  of  CLa.  When  this  process  is  coaplete,  TM^ 
tells  all  DMs  that  it  has  taken  over  Dla's  functions. 

If  a  DM  wants  to  know  what  happened  to  a  particular  transaction, 
Tj,  that  was  supervised  by  Hla,  it  asks  TM^.  If  T^  is  in  TK^'s  CLa» 
then  IM^  tells  the  DM  to  coaait  T^;  otherwise,  it  tells  the  DM  to  abort 
T^.  Thus,  a  transaction  that  was  supervised  by  XMa  is  coaaitted  if  and 
only  if  it  reached  the  second  phase  of  three-phase  coaait  and  at  least 
one  of  its  precoaaits  reached  a  backup  HI  (that  didn't  fail). 

The  algoritha  for  electing  a  backup  HI  to  replace  Hla  is  easy,  as 
long  as  none  of  the  backups  fall  or  recover  froa  failure  during  the 
election.  Assuae  each  HI  has  a  unique  identifier.  To  elect  a  replace- 
aent  for  Hla,  each  backup  exchanges  its  identifier  with  every  other 
backup.  The  HI  with  the  largest  identifier  wins  the  election  and  takes 
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over. 

If  backup  TVs  fail  or  recover  froa  failure  during  the  ejection,  tie 
above  algoritha  can  aisbehave.  Each  of  tvo  TVs  can  conclude  that  it  von 
tie  election.  Algoritbas  to  prevent  this  behavior  are  diacusaed  in 
[GABC82 .  SKEESla] . 

It  is  possible  that  TVa  and  all  of  its  backups  fail  during  a  abort 
tiae  period  —  too  short  for  replaceaent  backups  to  be  acquired.  This 
is  called  a  total  failure  of  TVt;  no  TV  can  ever  take  over  its  function. 
DMs  aust  wait  until  TV#  and  enough  of  its  backups  have  recovered  so  that 
the  correct  status  of  TV^'s  transactions  can  be  deterained.  Algorithas 
for  recovering  after  total  failure  are  discussed  in  [SKEE81]. 


Many  variations  on  three-  phase  coaait  protocols  have  been  proposed 
and  analyzed.  See  [ALSB78,  ALBS76.  COOP82 .  EAGE81 .  BAVV80.  LAMP78, 
MENA78,  TRAI82] . 


3.10  Replicated  Data 

If  a  DM  fails,  transactions  that  need  the  failed  DM's  data  aust 
wait  for  the  DM  to  recover.  To  avoid  this  delay,  the  DBS  can  replicate 
data;  that  is,  it  can  store  parts  of  the  database  at  aore  than  one  DM. 
If  one  copy  is  unavailable  due  to  a  Hi  failure,  other  copies  can  be  used 
instead. 

Many  concurrency  control  algorithas  are  ksovn  for  keeping  aultiple 
copies  of  each  page  autually  consistent.  However,  even  if  concurrency 
control  is  perforaed  correctly,  failures  can  cause  transactions  to  aal- 
function. 

For  ezaaple,  suppose  Pj  has  copies  PjB  and  Pjb  at  DMB  and  DM^ 
(resp.),  and  P2  has  copies  P2C  and  P24  at  DMC  and  DM^.  Tj  reads  P^  and 
vritea  Pj;  Tj  reads  P2  and  writes  P^.  Replicated  data  is  handled  by 
the  'intuitive*  algor itha:  to  read  data,  read  any  copy;  to  write  data, 
write  all  available  copies.  The  following  execution  obeys  these  rules, 
yet  it  is  incorrect. 
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Read2(P2a) 

DMd-faile 

*rite2(P2c) 

Resd2(P2d) 

DMa-fails 

Write2(Plb) 

This  execution  is  incorrect  because  Tj  reads  (a  copy  of)  P2  before  T2 
writes  Pj ,  while  T2  reads  (a  copy  of)  P2  before  T2  writes  P2.  The  first 
condition  aeans  that  T2  appears  to  precede  T2,  while  the  second  condi¬ 
tion  aeans  that  T2  appears  to  precede  T^.  These  conditions  cannot  both 
hold  in  a  serial  execution,  and  so  the  given  execution  is  incorrect. 

Algorithas  for  correctly  processing  couaands  on  replicated  data  in 
the  presence  of  DM  failures  appear  in  [ALSB78,  ALBS76,  OOOP82 ,  EAGE81, 
GIFF79,  HAMM80,  MENA78,  THOM79].  No  consensus  on  the  best  approach  to 
this  problem  has  yet  emerged. 
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4.  Performance  of  Distributed  Concurrency  Control 

Many  factors  effect  the  performance  of  a  distributed  concurrency 
algor itba: 

1.  10  delay, 

2.  communication  delay, 

3.  ratio  of  read-only  to  write  transactions, 

4.  database  size,  transaction  size, 

5.  system  mul tiprograaaing  level, 

6.  distribution  and  replication  of  the  database, 

7.  overhead  of  deadlock  detection, 

8.  and  system  load,  defined  as  the  product  of  transaction  size  and  mul¬ 
tiprogramming  level  divided  by  the  database  size. 

Our  simulation  study  of  the  performance  of  distributed  concurrency  con¬ 
trol  algorithms  shows  that  four  of  these  factors  have  more  significant 
impact  than  the  others:  10  delay,  communication  delay,  transaction  size, 
and  system  load.  Hence  we  divide  our  simulation  results  into  groups  and 
discuss  them  separately  by  classifying  the  system  environment  as  either 
IQ-bound  or  communication  bound,  and  as  either  short  transaction  loaded 
or  Ions  transaction  loaded.  Ve  consider  a  system  to  be  10  bound  if 
queuing  for  10  or  CPU  resources  is  s  more  significant  problem  than  queu¬ 
ing  for  communication  channel;  and  we  consider  a  system  to  be  communica¬ 
tion  bound  if  queuing  for  cosimunication  channel  is  s  more  significant 
problem  than  queuing  for  10  and  CPU  resources.  Ve  consider  a  system  to 
be  short  transaction  loaded  if  the  average  number  of  data  items 
requested  by  the  transactions  (or  transaction  size)  is  less  than  Q .05% 
of  the  database.  The  system  is  long  transaction  loaded  if  the  average 
is  larger  than  £<2*  of  the  database.  If  the  average  is  between  0.05% 
and  0.2%  of  the  database,  the  classification  of  the  system  as  short 
transaction  loaded  or  long  transaction  loaded  depends  on  the  system 
load.  Details  of  the  classif ication  can  be  found  in  Figure  4.1. 


Thus  wejpresent  four  categories  of  system  environments:  short  transac¬ 
tion  loaded  and  10  bound  (SIO),  short  transaction  loaded  and  communica¬ 
tion  bound  (SCM) ,  long  transaction  loaded  and  10  bound  (LIO),  and  long 
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System  Load 
Trana  Size 

<  10% 

>  10% 

<  0.05% 

Short 

Shor  t 

0.05%<0.2% 

Short 

Long 

>  0.2% 

Long 

Long 

Trans  Size:  Average  number  of  data  items  requested  by  a 

transaction  as  a  percentage  of  the  database  size. 

System  Load:  Trans  Size  multiplied  by  the  multiprogramming 
level. 

Database  size:  Total  number  of  data  items  in  the  database. 

Figure  4.1  System  Classification 

(Short  Loaded  or  Long  Loaded) 

transaction  loaded  and  coaimuni cation  bound  (LCM).  For  each  of  these 
four  environments,  we  compare  the  performance  of  various  concurrency 
control  algorithms,  taking  into  consideration  the  factors  that  are  not 
used  to  classify  the  system  environment  —  i.e.  multiprogramming  level, 
ratio  of  read-only  to  write  transactions,  distribution  and  replication 
of  the  database. 

We  first  describe,  in  Section  4.1,  the  distributed  DBMS  model  that 
ve  use  to  evaluate  these  algorithms.  We  then  define  and  describe,  in 
Section  4.2,  the  concurrency  control  algorithms  that  we  evaluate.  We 
compare  these  algorithms  in  Section  4.3.1  through  4.3.4  for  each  of  the 
four  environments.  In  Section  4.4  we  suaimarize  the  results  of  Section 
4.  Details  of  the  simulation  results  can  be  found  in  the  Appendix.  e 

To  use  this  section  as  a  design  guide,  a  system  designer  must  first 
classify  his  system  environment,  using  the  following  three  parameters. 
First,  he  must  decide  whether  his  system  environment  is  10  bound  or  com¬ 
munication  bound.  Second,  he  must  estimate  the  average  number  of  data 
items,  as  a  percentage  of  the  total  number  of  data  items  in  the  data¬ 
base,  requested  by  a  transaction  (transaction  size).  Third,  he  must 
estimate  the  average  system  load,  which  is  the  product  of  the  transac¬ 
tion  size  and  the  multiprogramming  level  of  the  system  (number  of  tran¬ 
sactions  running  concurrently).  Using  these  three  parameters  and  Figure 
4.1,  the  designer  can  find  his  system  clsssif icstion.  For  each  classif¬ 
ication,.  he  can  find  the  comparison  of  various  distributed  concurrency 
control  algorithms  in  Section  4.3.1  through  Section  4.3.4. 
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4.1  Performance  Model 

Ve  assume  that  there  are  two  kinds  of  transactions:  read-only  tran¬ 
sactions  and  write  transactions  (update  transactions).  Write  transac¬ 
tions  always  read  what  they  write.  and  write  what  they  read.  This 
assumption  may  seem  restrictive,  but  it  is  a  good  approximation  of  real 
applications.  Our  earlier  simulation  results  [LlN81a]  showed  that  the 
total  number  of  requests  and  the  ratio  of  read-only  requests  to  write 
requests  active  at  any  moment  in  the  system  have  much  greater  impact  on 
the  system  performance  than  the  ratio  of  read-only  to  write  transac¬ 
tions.  Moreover  our  analysis  shows  that  a  more  general  assumption  of 
transactions  would  not  favor  any  concurrency  control  algorithm;  thus  for 
performance  comparison  of  the  algorithms,  this  assumption  would  not  dis¬ 
tort  the  results.  To  use  the  results  of  this  section  to  evaluate  the 
performance  of  a  system  that  has  transactions  reading  more  than  writing, 
the  ratio  of  read-only  to  write  transactions  in  the  system  can  be 
adjusted  upward. 

A  read-only  transaction  consists  of  a  sequence  of  read-only 
requests,  and  each  request  reads  a  data  item.  A  write  transaction  con¬ 
sists  of  a  sequence  of  write  requests  (update  requests),  followed  by  a 
two-phase  commit.  Requests  from  a  transaction  are  processed  sequen¬ 
tially;  another  request  is  initiated  only  after  the  previous  one  has 
been  successfully  processed. 

As  described  in  Section  2,  a  distributed  DBMS  consists  of  IMs, 
schedulers,  and  DMs.  Each  transaction  is  managed  by  a  TM,  which 
sequences  its  requests  and  sends  them  to  the  appropriate  scheduler  to  be 
processed.  If  the  scheduler  site  is  different  from  the  TM  site,  a  com¬ 
munication  delay  is  incurred. 

If  a  request  is  read-only,  the  scheduler  requests  a  read  lock  for 
the  requested  data  item  (assuming  that  a  two  phase  locking  algorithm  is 
used).  Depending  on  the  particular  concurrency  control  algorithm  used, 
some  lock  managers  may  grant  the  lock  without  checking  whether  the 
^  request  .conflicts  with  another  transaction.  Other  lock  managers  may 

check  for  the  conflict.  If  a  conflict  is  found,  the  resd-only  request 
waits  and  incurs  a  blocking  delay.  Depending  on  the  concurrency  control 
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algorithm  used,  the  schednler  may  initiate  a  deadlock  detection  when 
blocking  occurs,  thus  incurring  processing  and  possibly  communication 
overhead.  When  the  lock  for  the  requested  data  item  is  obtained,  the 
scheduler  sends  the  read-only  request  to  the  appropriate  Ml.  and  the 
read-only  request  incurs  a  processing  delay.  A  read-only  transaction 
ends  after  all  its  requests  have  been  successfully  processed. 

A  write  request  is  processed  in  a  manner  similar  to  a  read  request, 
except  that  successful  processing  of  all  write  requests  of  a  transaction 
is  always  followed  by  a  two-phase  commit,  and  a  write  transaction  ends 
after  the  two-phase  commit  is  successfully  processed  (two-phase  commit 
is  the  only  reliability  algorithm  that  we  use  in  our  simulation  of  con¬ 
currency  control  algorithm). 

If  timestamp  based  algorithms  are  used,  a  timestamp  is  assigned  to 
each  transaction,  and  requests  from  the  transaction  inherit  the  transac¬ 
tion  timestamp.  Each  data  item  also  has  read  and  write  timestamps  that 
record  the  timestamps  of  the  transactions  that  last  read  from  (or  write 
into)  the  data  item.  For  all  the  timestamp  algorithms  that  we  have 
evaluated,  the  scheduler  always  resides  at  the  site  of  a  DM,  and  a 
request  is  always  sent  to  the  scheduler  at  the  site  where  the  data  is  to 
be  accessed.  When  a  scheduler  receives  a  request,  it  compares  the 
timestamp  of  the  request  with  the  read  and  write  timestamp(s)  of  the 
data  item,  and  it  may  or  may  not  delay  the  request,  depending  on  the 
particular  algorithm  used.  If  the  request  is  not  blocked,  it  is  sent  to 
the  DM  at  the  scheduler  site,  and  the  request  incurs  a  processing  delay. 

We  simulate  both  10  bound  and  communication  bound  system  environ¬ 
ments.  In  the  10  bound  environment,  we  explicitly  simulate  queuing  for 
local  processing,  which  combines  cpu  and  10  processing.  We  differen¬ 
tiate  between  local  processing  of  simple  messages,  such  as  lock  request, 
lock  release,  and  deadlock  detection,  and  local  processing  of  data 
requests.  The  latter  needs  more  processing  time  than  the  former.  In 
the  10  bound  environment,  we  do  not  simulate  queuing  for  communication 
channels.  Communication  delay  is  simply  simulated  by  a  delay  drawn  from 
a  probabilistic  distribution. 


Distributed  Database  System  Designer  Handbook 
Performance  of  Distributed  Concurrency  Control 


Section  4 


In  the  communication  bound  environment,  we  explicitly  simulate 
queuing  for  communication  channels,  but  not  for  local  processing 
resources.  In  some  cases,  we  differentiate  between  message  and  data 
transmission.  The  latter  takes  longer  than  the  former.  We  simulate 
local  delay  (combining  10  and  cpu  processing)  by  drawing  a  random  number 
from  a  probabilistic  distribution. 

The  performance  parameters  that  we  use  to  compare  distributed  con¬ 
currency  control  algorithms  include  read  throughout,  write  throughout, 
average  read  response  time,  and  average  write  response  time.  Head 
throughput  is  the  number  of  resd-only  requests  successfully  completed 
per  time  unit;  read-only  requests  processed  and  subsequently  aborted  are 
not  included.  The  write  throughput  is  similarly  defined.  Head  response 
time  is  measured  from  the  time  a  read-only  request  is  initiated  by  a  TM 
to  the  time  when  the  next  read-only  request  of  the  same  transaction  is 
initiated  by  the  same  Hi.  Thus,  it  may  include  communication  delay, 
blocking  delay,  and  processing  delay.  Average  read  response  time  aver¬ 
ages  over  the  response  times  of  all  successfully  completed  read-only 
requests.  Average  write  response  time  is  similarly  computed. 

In  addition  to  blocking  delay,  communication  delay,  and  processing 
delay,  other  factors  also  affect  average  response  times  and  throughputs 
(e.g.,  transaction  abortion,  deadlock  detection,  and  multiple  versions 
of  data).  The  concurrency  control  algorithms  evaluated  in  this  section 
can  be  differentiated  by  the  way  they  trade  off  these  factors.  Some 
algorithms  trade  longer  blocking  delay  for  fewer  transaction  abortions, 
and  others  trade  reversely.  Some  trade  more  communication  delay  for 
less  blocking  delay,  and  others  trade  reversely.  Ve  describe  these 
algorithms  in  the  next  section.  In  Section  4.3,  based  on  the  total 
throughput,  we  compare  and  rank  these  algorithms.  Detailed  data  of  the 
performance  parameters  can  be  found  in  the  Appendix. 
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4.2  Description  of  Algorithms 

The  algorithms  that  we  will  consider  are  listed  below.  Selection 
of  these  algorithais  is  based  on  onr  earlier  heuristic  evaluation 
reported  in  [BE£N81s].  The  selected  algorithais  were  shown  to  perform 
better  than  the  algorithms  discarded.  Names  of  some  algorithms  are 
linked  by  the  conjunctive  'and'  (e.g.  Primary  Site  and  Primary  Site). 
The  term  before  the  conjunctive  describes  the  method  used  for  read 
requests,  and  the  term  after  the  conjunctive  describes  the  method  used 
for  write  requests.  These  algorithms  are  described  briefly  in  this  sec¬ 
tion  and  suaimarized  in  Figure  4.2.  Details  of  these  algorithms  can  be 
found  in  the  references. 

1.  Primary  Site  and  Primary  Site  Two  Phase  Locking  (C-C) 

2.  Primary  Copy  and  Primary  Copy  Tiro  Phase  Locking  (P-P) 

3.  Basic  and  Basic  Two  Phase  Locking  (B~B) 

4.  Basic  and  Primary  Copy  Two  Phase  Locking  (B-P) 

3.  Basic  and  Primary  Site  Two  Phase  Locking  (B-C) 

6.  DDK  Multiple  \ctsioa  and  Optimistic  Two  Phase  Locking  (DDM) 

7.  Basic  and  Optimistic  Two  Phase  Locking  (Opm) 

8.  Majority  Consensus  Timestamp  (Maj) 

9.  Wait-Die  Two  Phase  Locking  (Die) 

10.  Basic  Timestamp  (BaT) 

11.  Multiple  Version  Timestamp  (MvT) 

12.  Dynamic  Timestamp  (Dyn) 

The  SDD-1  algorithm  is  not  explicitly  covered  because  the  Dynamic 
Timestamp  algorithm  is  an  improved  version  of  it  ([LIN79,  [LIN81]). 
Neither  is  the  Conservative  Timestamp  algorithm  covered,  because  this 
algorithm  essentially  executes  transactions  serially  in  timestamp  order. 
Thus  it  can  perform  better  than  other  algorithms  only  when  the  transac¬ 
tion  size  is  very  large  and  the  system  load  is  extremely  heavy  and  con¬ 
current  execution  of  transactions  becomes  counterproductive. 

The  Primary  Site  and  Primary  Site  method  is  essentially  a  central¬ 
ized  two-phase  locking  method.  All  requests  for  read  locks  and  write 
locks  are  sent  to  and  processed  by  a  designated  primary  site,  which  may 
use  backup  sites  to  improve  resiliency.  This  method  trades  fewer  tran¬ 
saction  abortions  for  more  transaction  blocking,  and  it  checks  for  lock 
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conflict  as  early  as  possible.  It  detects  deadlock  as  early  as  possi¬ 
ble,  and  it  avoids  distributed  deadlock  detection;  but  it  has  a 
bottleneck  at  the  primary  site. 

The  Primary  Copy  and  Primary  Copy  method  is  a  generalized  version 
of  the  Primary  Site  and  Prisury  Site  method.  All  requests  for  read 
locks  and  write  locks  are  sent  to  and  processed  by  a  designated  primary 
copy  site.  However,  primary  copy  sites  for  different  data  items  may  be 
different,  thus  distributed  deadlock  may  occur.  This  method  also  trades 
fewer  transaction  abortions  for  more  transaction  blocking,  and  it  checks 
lock  conflict  as  early  as  possible.  It  requires  distributed  deadlock 
detection,  but  it  may  delay  deadlock  detection  to  reduce  communication 
overhead. 

The  Basic  and  Basic  method  sets  read  locks  and  reads  data  locally 
if  a  local  copy  is  available;  otherwise  it  locks  and  reads  the  closest 
copy.  It  sets  write  locks  globally.  For  each  update  request,  an  update 
lock  is  requested  from  all  copies,  and  the  update  request  is  granted 
only  after  locks  from  all  copies  are  obtained.  This  method  trades  fas¬ 
ter  read-only  transaction  response  time  for  slower  write  transaction 
response  time.  It  also  trades  more  transaction  blocking  for  fewer  tran¬ 
saction  abortions.  It  checks  for  lock  conflict  and  deadlock  as  early  as 
possible,  and  at  the  expense  of  more  communication  overhead. 

The  Basic  and  Primary  Copy  method  processes  read  requests  as  the 
previous  method  does,  but  it  requests  write  locks  only  from  a  designated 
primary  copy.  This  method  checks  for  most  lock  conflict  as  soon  as  pos¬ 
sible,  but  it  may  delay  distributed  deadlock  detection  to  reduce  commun- 
ication  overhead.  This  method  also  trades  fewer  transaction  abortions 
for  more  transaction  blocking. 

The  Basic  and  Primary  Site  method  is  similar  to  the  last  method 
except  that  update  lock  requests  are  sent  to  a  central  site  instead  of 
to  several  primary  copy  sites.  Thus  deadlock  detection  is  more  central¬ 
ized  than  in  the  previoua  method,  and  overhead  is  more  centralized  at 
the  primary  site. 


The  PPM  [CHAN 82 a,  CHAN82b]  method  avoids  conflict  between  read 
requests  and  update  requests  by  keeping  several  versions  of  each  data 


Page  4-8 
Section  4 


Distributed  Database  System  Designer  Handbook 
Performance  of  Distributed  Concurrency  Control 


item.  For  each  update  request,  DDK  locks  locally  (if  a  local  copy 
exists,  or  locks  the  closest  copy).  The  update  lock  is  propagated  to 
other  copies  at  transaction  end.  Detection  of  most  conflicts  among 
update  requests  is  delayed  until  transaction  end.  Thus  blocking  delay 
is  minimized  for  most  write  transactions  at  the  expense  of  more  transac¬ 
tion  abortions  at  transaction  end. 

The  Basic  and  Optimistic  method  sets  read  and  update  locks  locally, 
if  a  local  copy  exists;  otherwise  it  locks  the  closest  copy.  The  update 
lock  is  propagated  to  all  copies  when  the  transaction  that  holds  the 
updste  lock  ends.  Thus,  distributed  lock  conflict  checking  and  deadlock 
detection  is  delayed  until  a  transaction  ends.  This  algorithm  reduces 
transaction  blocking  delay  at  the  expense  of  more  transaction  abortions. 

The  Mi  ioritv  Consensus  algorithm  is  similar  to  the  Basic  Optimistic 
algorithm.  Each  transaction  has  two  phases:  a  read  phase  and  a  commit 
phase.  During  the  read  phase,  a  transaction  reads  locally  if  a  local 
copy  exists;  otherwise  it  reads  the  closest  copy.  Timestamps  of  data 
items  read  by  a  transactions  are  recorded.  During  the  coawit  phase, 
both  read-only  and  update  transactions  must  be  certified  by  comparing 
the  timestamps  of  the  data  read  by  each  transaction  to  the  transaction 
timestamp.  Because  of  the  certification  step,  read-only  transactions 
require  more  communication  overhead  in  this  algorithm  than  in  the  Basic 
Optimistic  algorithm.  The  details  of  the  algorithm  can  be  found  in 
[BERN81a,THOM79] .  If  the  algorithm  is  modified  to  favor  read-only  tran¬ 
sactions  so  that  read-only  transactions  need  no  certification,  then  it 
requires  no  more  communication  overhead  than  the  Basic  Optimistic  algo¬ 
rithm.  This  algorithm  checks  for  lock  conflicts  as  late  as  possible, 
and  it  trades  less  transaction  blocking  for  more  transaction  abortions. 

In  the  Wait-Die  algorithm,  a  unique  sequence  number  is  attached  to 
every  transaction.  A  transaction  always  locks  locally  if  a  local  copy 
is  available;  otherwise  it  locks  the  closest  copy.  The  locks  are  pro¬ 
pagated  to  other  copiea  when  the  transaction  commits.  Whenever  a  tran¬ 
saction  it  blocked  by  another  transaction,  the  algorithm  compares  the 
sequence  numbers  of  the  two  transactions.  If  the  blocked  transaction 
has  a  lower  priority  sequence  number,  it  waits,  otherwise  it  aborts. 
This  algorithm  checks  local  lock  conflict  as  soon  as  possible,  eut  it 
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checks  distributed  conflict  at  transaction  end.  It  has  no  transaction 
deadlock  (at  the  expense  of  more  transaction  abortions). 

In  the  Basic  Timestamp  method,  a  read  and  a  write  timestamp  are 
attached  to  each  data  item  of  the  database.  Each  transaction  that  reads 
or  updates  the  data  item  updates  its  read  or  write  timestamp.  Conflict 
is  detected  by  comparing  the  timestamp  of  the  transaction  that  reads  or 
writes  a  data  item  with  the  timestamps  of  the  data  item,  and  not  by  cosr- 
paring  the  timestamps  of  two  transactions  as  done  by  the  Wait-Die  algo¬ 
rithm.  This  algorithm  is  similar  to  the  Wait-Die  algorithm  because  it 
also  avoids  transaction  deadlock.  Unlike  the  Wait-Die  algorithm,  it  has 
no  blocking  delay  and  possibly  has  more  transaction  abortions.  This 
algorithm  may  have  fewer  transaction  abortion  than  the  Wait-Die  algo¬ 
rithm  when  most  transactions  are  read-only,  because  it  allows  two  tran¬ 
sactions  (s  read-only  and  a  write)  to  access  the  same  data  item  simul¬ 
taneously. 

The  Multiple  Version  Timestamp  algorithm  is  a  generalization  of  the 
previous  algorithm.  It  keeps  several  versions  of  each  data  item  in 
order  to  reduce  conflict  between  read-only  transactions  and  update  tran¬ 
sactions.  Thus,  this  method  trades  more  overhead  of  maintaining  multi¬ 
ple  data  versions  for  fewer  transaction  abortions. 

The  Dynamic  Timestamp  algorithm  [LIN79,  LIN81]  is  an  improved  ver¬ 
sion  of  SDD-1  algorithm;  it  is  unique  among  all  the  algorithms  that  we 
will  compare  for  the  following  reasons.  It  requires  transaction  times¬ 
tamps  but  not  data  item  timestamps.  It  does  not  avoid  transaction 
blocking;  thus  it  trades  more  transection  blocking  for  fewer  transaction 
abortions.  But  it  uses  preanalysis  of  transactions  to  reduce  unneces¬ 
sary  transaction  blocking.  This  algorithm  may  require  a  lot  of  communi¬ 
cation  overhead  when  many  null  write  messages  are  needed  [BERN82,  LIN79, 
LIN81],  and  its  performance  may  depend  on  system  load  [L1N81].  Thus  it 
may  perform  poorly  in  some  system  environments. 

The  principal  characteristics  of  these  algorithms  are  summarised  in 
Figure  4.2. 
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B-B  P-P  C-C  B-P  BsT  MvT  DDK  Opm  Maj  Die  Dyn 


blocking/abortion  bbbbaaaaamb 

lock  conflict  check  ssssssxxlxs 
deadlock  detection  s  1  s  1  111 

Scheduler  2  2  2  2  t  t  2,c  2,c  c  2  t 

Location  of  Scheduler  ddcndddddddd 
Data  Beplication  npppppppvpn 


b:  transaction  blocking  is  preferred, 

a:  transaction  abortion  is  preferred, 

m:  both  blocking  and  abortion  are  nsed. 
s:  conflict  or  deadlock  is  checked  as  soon  as  possible. 

1:  conflict  or  deadlock  is  checked  as  late  as  possible, 

x:  local  conflict  is  checked  as  soon  as  possible,  bnt 
distributed  conflict  is  checked  at  transaction  end. 

'  ' :  the  item  does  not  apply. 

2:  tvo-phase  locking  scheduler, 
t:  timestamp  scheduler, 
c:  certifier  scheduler. 

2,c:  mixed  2-phase  locking  and  oertifier  scheduler. 

cn:  centralised. 

d:  distributed. 

n:  do  nothing. 

p:  primary  copy. 

v:  voting. 


Figure  4.2  Summary  of  Concurrency  Control  Algorithms 


4.3  Performance  Evaluation 


4.3.1  Short  Transaction  Loaded  0  10  Bound 

In  this  section  we  compare  the  performance  of  distributed  con¬ 
currency  control  algorithms  in  a  system  environment  in  which  most  tran¬ 
sactions  are  relatively  short  and  10  resource  is  the  performance 
bottleneck.  The  comparison  of  these  algorithms  is  summarized  in  Figure 
4.3.  The  comparison  is  based  on  actual  simulation  results  except  for 
the  Vait-Die,  Majority  Consensus  Timestamp,  and  Dynamic  Timestamp  algo¬ 
rithms.  The  evaluation  of  the  Vait-Die  algorithm  is  based  on  its  simi¬ 
larity  to  the  Basic  Timestamp  algorithm;  the  evaluation  of  the  Dynamic 
Timestamp  algorithm  is  based  on  the  results  of  [LIN81];  and  the  evalua¬ 
tion  of  the  Majority  Consensus  Timestamp  algorithm  is  based  on  its  simi¬ 
larity  with  the  Basic  Optimistic  algorithm. 


Figure  4.3  shows  that  five  algorithms  perform  better  than  others: 
the  Basic  Timestamp,  Multiple  Version  Timestamp,  DDM,  Optimistic,  and 
Vait-Die  algorithms. 
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In  tbe  short  transaction  loaded  and  10  bound  environment,  we  found 
that  transaction  abortion  is  a  better  strategy  than  transaction  blocking 
(i.e.  it  is  better  to  abort  than  to  wait).  The  abortion  strategy  is 
used  by  the  B*sic  Timestamp  and  Multiple  Version  Timestamp  algorithms, 
and  to  a  large  degree  by  the  Vait-Die  algorithm.  Ve  also  found  that  it 
is  better  to  delsy  lock  conflict  detection  than  to  detect  lock  conflict 
early.  Both  the  DDM  and  the  Basic  Optimistic  algorithms  use  the  delay 
strategy. 

Although  the  DDM  algorithm  uses  locking  for  write  transactions,  and 
the  Optimistic  algorithm  uses  locking  for  both  read  and  write  transac¬ 
tions,  blocking  occurs  only  among  local  transactions  that  access  data 
from  the  same  site.  Transactions  running  at  different  sites  never  block 
each  other.  Write  locks  are  propagated  to  other  sites  at  transaction 
end,  then  conflicts  among  transactions  running  at  different  sites  are 
detected  and  always  result  in  transaction  abortions.  Therefore  perfor¬ 
mance  of  these  two  algorithms  is  closer  to  those  of  timestamp  algorithms 
than  to  those  of  two-phase  locking  algorithms.  However,  notice  that  the 
DDM  and  Basic  Optimistic  algorithms  always  abort  transactions  at  tran¬ 
saction  end,  while  the  timestamp  algorithms  may  abort  transactions  at  an 
earlier  phase  of  their  execution. 

These  five  algorithms  perform  equally  well  in  most  cases.  The 
timestamp  algorithms  perform  better  than  the  DDM  and  Basic  Optimistic 
algorithms  when  the  database  is  fully  redundant  (thus  read-only  transac¬ 
tions  complete  quickly),  the  R/W  ratio  is  high  (probability  of  conflict 
among  data  requests  is  small),  and  local  delay  is  large  (local  blocking 
delay  is  large  and  abortion  at  transaction  end  is  expensive).  However 
when  the  database  is  less  redundsnt,  the  DDM  and  Basic  Optimistic  algo¬ 
rithms  perform  slightly  better  than  the  timestamp  algorithms.  Both 
read-only  and  write  transactions  require  some  remote  data  accesses  and 
take  longer  to  complete,  and  this  causes  the  probability  of  conflict 
among  transactions  to  rise  and  the  timestamp  algorithms  to  abort  more 
transactions. 

The  Basic  Timestamp  algorithm  performs  as  well  as  the  Multiple  Ver¬ 
sion  Timestamp  algorithm,  and  the  latter  requires  more  overhead  and 
storage  space  for  keeping  multiple  versions  of  data  [LINN83].  Therefore 
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the  Basic  Timestamp  algorithm  is  preferable  to  the  Multiple  Version 
Timestamp  algorithm,  unless  the  multiple  versions  of  data  are  required 
in  any  case  for  database  recovery  and  resiliency.  Similarly,  the 
difference  in  performance  between  the  DDM  and  Basic  Optimistic  algo¬ 
rithms  is  very  small,  and  the  former  needs  higher  overhead  and  more 
storage  space  for  keeping  multiple  versions  of  data.  The  Basic  Optimis¬ 
tic  algorithm  is  preferable,  unless  the  versions  of  data  are  required  in 
any  case  for  database  recovery  and  resiliency. 

The  Wait-Die  algorithm  performs  slightly  worse  than  the  Basic 
Timestamp  algorithm  when  most  transactions  are  read-only.  When  a  read¬ 
only  transaction  conflicts  with  a  write  transaction,  the  timestamp  algo¬ 
rithms  never  abort  the  read-only  transaction,  and  they  abort  the  write 
transaction  only  when  a  nonserializable  execution  may  occur.  However 
when  most  transactions  are  write  transactions,  the  Wait-Die  algorithm  is 
preferred  because  it  performs  as  well  as  the  Basic  Timestamp  method  and 
it  needs  no  data  item  timestamps,  which  require  storage  space  and  pro¬ 
cessing  overhead. 

The  Dynamic  Timestamp  algorithm  performs  best  when  most  transac¬ 
tions  are  read-only,  communication  is  fast,  database  is  almost  fully 
redundant,  and  preanalysis  can  be  done  on  most  transactions.  In  this 
environment,  the  fast  protocols,  Rl,  Rla,  Rlab,  and  Rib  [LIN79] ,  LIN82] 
apply  to  most  transactions.  Assuming  system  conditions  remain  the  same 
except  that  the  database  is  not  redundant,  the  Dynamic  Timestamp  algo¬ 
rithm  still  performs  relatively  well,  because  more  efficient  protocols 
(R2,  R2a,  R2ab,  and  R2b)  apply  to  most  transactions.  These  protocols 
are  not  as  efficient  as  the  group  of  Rl  protocols,  but  they  are  rela¬ 
tively  fast  compared  with  R3  protocol.  In  all  other  cases,  either  when 
the  communication  is  slow  or  when  most  transactions  update  the  database, 
the  Dynamic  Timestamp  algorithm  is  not  efficient. 

The  Majority  Consensus  algorithm  performs  reasonably  well,  but  not 
as  well  as  the  Basic  Optimistic  algorithm.  The  Majority  Consensus  algo¬ 
rithm  as  proposed  in  (THOM79]  requires  extra  communication  overhead  for 
read-only  transactions.  If  the  algorithm  is  modified  to  favor  read-only 
transactions,  so  that  read-only  transactions  need  not  be  certified,  then 
it  would  perform  as  well  as  the  Basic  Optimistic  algorithm. 
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To  summarize,  in  this  environment  transaction  abortion  ia  a  better 
strategy  than  transaction  blocking,  and  delayed  lock  conflict  checking 
is  a  better  strategy  than  early  lock  conflict  checking. 
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Figure  4.3  Performance  Comparison:  Short 
Transaction  Loaded  $  10  Bound 


4.3.2  Short  Transactions  f  Communication  Bound 

In  this  section  we  compare  the  performance  of  distributed  con¬ 
currency  control  algorithms  in  a  system  environment  in  which  most  tran¬ 
sactions  are  relatively  short  and  communication  channel  is  the  perfor¬ 
mance  bottleneck.  The  comparison  of  the  algorithms  is  summarized  in 
Figure  4.4.  The  comparison  is  based  on  actual  simulation  results  except 
for  the  Wait-Die,  Majority  Consensus,  and  the  Dynamic  Timestamp  algo¬ 
rithms.  The  evaluation  of  the  Wait-Die  algorithm  is  based  on  its  simi¬ 
larity  to  the  Basic  Timestamp  algorithm;  the  evaluation  of  the  Dynamic 
Timestamp  algorithm  ia  based  on  the  results  of  [L1N81];  and  the  evalua¬ 
tion  of  the  Majority  Consensus  algorithm  is  based  on  its  similarity  to 
the  Basic  Optimistic  algorithm. 

Figure  4.4  shows  that  seven  algorithms  perform  better  than  the  oth¬ 
ers:  Basic-Primary  Copy,  Basic  Timestamp,  Multiple  Version  Timestamp, 
DDM,  Basic  Optimistic,  Wait-Die,  and  Dynamic  Timestamp. 
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Ve  found  that  transaction  abortion,  similar  to  tbe  S10  environment, 
is  a  better  strategy  than  transaction  blocking,  and  that  delayed  lock 
conflict  detection  is  a  better  strategy  than  early  detection.  However, 
because  of  tbe  communication  channel  bottleneck,  performance  of  the 
algorithms  that  require  extra  communication  messages  degrade  in  aome 
cases. 

The  Basic  Timestamp  and  Multiple  Version  Timestamp  algorithms  per¬ 
form  best  in  all  cases.  However,  when  the  database  it  fully  redundant, 
the  DDK  and  Basic  Optimistic  algorithms  perform  just  as  well.  Sead-only 
transactions  never  incur  communication  delays,  and  write  transactions 
incur  communication  delays  only  during  the  commit  phase.  Therefore 
transactions  finish  fast,  blocking  delay  is  shorter,  and  abortion  at 
transaction  end  is  less  expensive. 

The  Majority  Consensus  algorithm,  as  proposed  in  [1HOM79] ,  does  not 
perform  well  because  of  the  extra  communication  messages  required  for 
read-only  transactions.  If  the  algorithm  is  modified  to  favor  read-only 
transactions,  so  that  read-only  transactions  need  not  be  certified,  the 
algorithm  would  perform  as  well  as  the  Basic  Optimistic  algorithm. 

The  Wait-Die  algorithm  performs  just  as  well  as  the  timestamp  algo¬ 
rithms  in  most  cases.  However,  when  most  transactions  are  read-only, 
the  Wait-Die  algorithm  unnecessarily  aborts  more  read-only  transactions 
than  the  timestamp  algorithms,  thus  performing  worse  than  the  timestamp 
algorithms. 

The  DDM  algorithm  performs  as  well  as  the  timestamp  algorithms  when 
the  database  is  fully  redundant.  However,  when  the  database  is  less 
redundant  and  most  transactions  are  read-only,  its  performance  degrades 
as  shown  in  Figure  4.4.  When  the  database  is  not  fully  redundant, 
read-only  transactions  require  one  extra  communication  message,  which 
causes  a  long  delay  in  a  communication  bound  environment. 

The  Basic-Primary  Copy  algorithm  performs  10%  to  20%  worse  than  the 
best  algorithms  in  all  cases,  because  it  incurs  extra  communication  mes¬ 
sages  when  obtaining  locks  from  the  primary  copies,  and  it  uses  transac¬ 
tion  blocking  instead  of  transaction  abortion.  The  Dynamic  Timestamp 
algorithm  performs  best  when  most  transaction  are  read-only  and  can  be 
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preanalyzed.  In  this  environment,  the  most  efficient  protocols  can  be 
used  and  communication  overhead  for  null-vrite  messages  is  minimized. 

Since  the  Bssic  Timestamp  algorithm  performs  as  veil  as  the  Multi¬ 
ple  Version  Timestamp  algorithm,  the  former  is  preferable  unless  the 
multiple  versions  of  data  are  required  in  any  case  for  database  recovery 
and  resiliency.  Similar  observations  apply  to  the  DDM  and  Basic 
Optimistic  algorithms  [LINN83]. 

Our  conclusion  is  that  in  this  environment  abortion  is  better  than 
blocking,  and  that  delayed  lock  conflict  checking  is  better  than  early 
look  conflict  checking.  However,  some  algorithms  that  use  these  two 
strategies  may  not  perform  well  in  some  cases  because  they  require  extra 
communication  messages. 
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Rank  1  is  best  and  Rank  6  is  worst. 

Rank  numbers  have  no  absolute  meaning.  They  only  show  relative 
performance  across  s  row,  not  a  column. 

R/V:  Ratio  of  Read-only  transactions  to  Write  transactions 
L/C:  Ratio  of  Local  delay  to  Communication  delay,  excluding 
queuing  delay 

Red:  Redundancy  of  the  database 
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Figure  4.4  Performance  Comparison:  Short  Transaction  Loaded 
I  Communication  Bound 


4.3.3  Long  Transaction  Loaded  I  10  Bound 

In  this  seotion  we  compare  the  performance  of  distributed  con¬ 
currency  control  algorithms  in  a  system  environment  in  which  most  tran¬ 
sactions  are  relatively  long  and  10  resource  is  the  bottleneck.  The 
comparison  is  summarised  in  Figure  4.5.  The  comparison  is  based  on 
actual  simulation  results  except  for  the  Vait-Die  and  Majority  Consensus 
algorithms.  The  evaluation  of  the  Vait-Die  algorithm  is  based  on  its 


Pate  4-16 
Section  4 


Distributed  Database  System  Designer  Handbook 
Performance  of  Distributed  Concurrency  Control 


similarity  to  the  Basic  Timestamp  algorithm;  and  the  evaluation  of  the 
Majority  Consensus  algorithm  is  based  on  its  similarity  to  the  Basic 
Optimistic  algorithm. 

Figure  4.5  shows  that  three  algorithms  perform  better  than  the  oth¬ 
ers:  Basic  Primary,  DDM,  and  Basic-Optimistic. 

In  this  environment  (long  transactions,  heavy  system  load)  transac¬ 
tions  conflict  with  each  other  more  often,  bnt  only  a  fraction  of  the 
conflicts  lead  to  transaction  deadlocks.  Thus,  transaction  blocking  is 
better  than  indiscriminate  transaction  abortion.  Moreover,  prompt  lock 
conflict  detection  is  better  than  procrastination.  Lock  conflicts  that 
are  detected  at  transaction  end  always  lead  to  deadlocks.  The  Basic 
Primary,  DDM,  and  Basic  Optimistic  algorithms  use  the  blocking  strategy. 
The  Basic  Primary  algorithm  uses  the  early  lock  conflict  detection  stra¬ 
tegy. 

The  Basic  Primary  Copy  algorithm  performs  best  in  this  environment 
because  it  does  not  abort  a  transaction  unless  it  deadlocks,  and  it 
detects  lock  conflicts  as  soon  as  they  occur.  However,  when  most  tran¬ 
sactions  are  read-only,  and  the  database  is  not  fully  redundant,  the 
Basic  Primary  Copy  does  not  perform  as  well  as  the  DDM  and  Basic- 
Optimistic  algorithms,  because  the  extra  communication  messages  required 
by  the  Basic  Priaiary  Copy  algorithm  for  write-locks  and  deadlock  detec¬ 
tions  does  not  outweigh  the  extra  transaction  abortions  by  the  DDM  and 
Basic-Optimistic  algorithm. 

The  DDM  and  the  Basic  Optimistic  algorithms  perform  well  in  par¬ 
tially  redundant  databases,  because  more  lock  conflicts  are  detected 
during  the  reading  phase  of  transactions  and  less  transactions  abort  at 
the  commit  phase.  However,  when  the  database  is  fully  redundant,  most 
lock  conflicts  are  detected  during  the  commit  phase,  which  always  leads 
to  deadlocks  and  transaction  abortions,  thus  resulting  in  the  poorer 
performance  of  these  two  algorithms  in  this  conditions. 

The  timestamp  algorithms  do  not  perform  as  well  as  the  Basic- 
Primary  method  because  transaction  blocking  is  better  than  transaction 
abortion.  However,  the  timestamp  algorithms  perform  better  than  the  DDM 
and  Basic-Optimistic  algorithms,  when  the  database  is  fully  redundant. 


Distributed  Database  System  Designer  Handbook 
Performance  of  Distributed  Concurrency  Control 


Page  4-17 
Section  4 


Read-only  transactions  incur  no  communication  delay  and  complete 
quickly;  the  read-phase  of  write  transactions  also  completes  quickly. 
Thus  conflict  between  read-only  transactions  and  write  transactions  that 
result  in  the  abortion  of  write  transactions  is  reduced.  In  addition, 
when  the  database  is  fully  redundant,  the  timestamp  algorithms  detect 
more  conflicts  at  the  read-phase,  thus  aborting  more  transactions  at 
earlier  stages  of  processing,  while  the  DDK  and  Basic-Optimistic  algo¬ 
rithms  detect  more  conflicts  at  the  commit  phase,  thus  aborting  more 
transactions  at  their  ends.  However,  when  the  database  is  not  fully 
redundant,  the  DDM  and  Basic-Optimistic  algorithms  detect  more  conflicts 
at  the  read-phase,  and  they  abort  more  transactions  at  the  early  stages 
of  processing,  thus  performing  better  than  the  timestamp  algorithms. 

The  Wait-Die  algorithm  performs  as  well  as  the  Basic  Timestamp 
algorithm,  except  when  most  transactions  are  read-only.  Then  the  Basic 
Timestamp  algorithm  has  higher  throughput  of  read-only  transactions  than 
the  Wait-Die  algorithm. 

The  Majority  Consensus  algorithm  also  performs  poorly  because  it 
delays  lock  conflict  detection  until  transaction  end,  thus  resulting  in 
many  late  transaction  abortions.  In  fact,  all  certifier  algorithms  that 
certify  transactions  at  transaction  end  perform  badly  in  the  long  tran¬ 
saction  environment.  The  Primary  Site  |  Primary  Site  (C-C)  and  the  Pri¬ 
mary  Copy  #  Primary  Copy  (P-P)  algorithms  also  perform  relatively  well 
when  the  database  is  fully  redundant.  These  two  algorithms  abort  fewer 
transactions  than  the  Basic  Timestamp,  Multiple  Version  Timestamp,  DDM, 
and  Basic  Optimistic  algorithms,  and  the  savings  in  transaction  abor¬ 
tions  more  than  make  up  for  the  extra  communication  messages  required  by 
the  two  algorithms.  The  Basic-Basic  algorithm  does  not  perform  as  well 
because  it  requires  many  more  communication  messages  than  other  algo¬ 
rithms. 

To  summarize,  in  this  environment  transaction  blocking  is  better 
than  transaction  abortion,  and  early  lock  conflict  detection  is  better 
than  late  detection. 
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Rank  1  is  best  and  Rank  6  is  worst. 

Rank  numbers  have  no  absolute  meaning.  They  only  show  relative 
performance  across  a  row,  not  a  column. 

R/W:  Ratio  of  Read-only  ransactions  to  Write  transactions 
L/C:  Ratio  of  Local  delay  to  Communication  delay,  excluding 
queuing  delay 

Red:  Redundancy  of  the  database 

*  :  Does  not  matter 


Figure  4.5  Performance  Comparison:  Long 
Transaction  Loaded  ft  10  Bound 


4.3.4  Long  Transactions  ft  Communication  Bound 

In  this  section,  we  compare  the  performance  of  distributed  con¬ 
currency  control  algorithms  in  a  system  environment  in  which  most  tran¬ 
sactions  are  long  and  communication  channel  is  the  bottleneck.  The  com¬ 
parison  of  these  algorithms  is  summarized  in  Figure  4.6.  The  comparison 
is  based  on  actual  simulation  results  except  for  the  Wait-Die  and  Major¬ 
ity  Consensus  algorithms.  The  evaluation  of  the  Wait-Die  algorithm  is 
based  on  its  similarity  to  the  Basic  Timestamp  algorithm;  and  the 
evaluation  of  the  Majority  Consensus  algorithm  is  based  on  its  similar¬ 
ity  to  the  Basic  Optimistic  algorithm. 

Figure  4.6  shows  that  six  algorithms  perform  better  than  the  oth¬ 
ers:  Basic  Timestamp,  Multiple  Version  Timestamp,  DDM,  Basic  Optimistic, 
Basic  Primary,  and  Wait-Die. 


In  this  system  environment  (long  transactions,  heavy  system  load, 
and  long  communication  delay)  transactions  conflict  with  each  other  more 
often,  but  only  a  fraction  of  the  conflicts  lead  to  deadlocks;  thus, 
transaction  blocking  is  better  than  indiscriminate  transaction  abortion. 
Moreover,  early  lock  conflict  detection  is  better  than  procrastination. 
Lock  conflicts  detected  at  transaction  end  always  lead  to  deadlocks. 
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Head-only  transactions  incur  no  coaaunication  delay  and  complete 
quickly;  the  read-phase  of  write  transactions  also  coapletes  quickly. 
Thus  conflict  between  read-only  transactions  and  write  transactions  that 
result  in  the  abortion  of  write  transactions  is  reduced.  In  addition, 
when  the  database  is  fnlly  redundant,  the  timestamp  algorithms  detect 
aore  conflicts  at  the  read-phase,  thus  aborting  aore  transactions  at 
earlier  stages  of  processing,  while  the  DDM  and  Basic-Optiaistic  algo¬ 
rithms  detect  aore  conflicts  at  the  commit  phase,  thus  aborting  more 
transactions  at  their  ends.  However,  when  the  database  is  not  fully 
redundant,  the  DDM  and  Basic-Optimistic  algorithms  detect  aore  conflicts 
at  the  read-phase,  and  they  abort  aore  transactions  at  the  early  stages 
of  processing,  thus  performing  better  than  the  timestamp  algorithms. 

The  Wait-Die  algorithm  performs  as  well  as  the  Basic  Timestamp 
algorithm,  except  when  most  transactions  are  read-only.  Then  the  Basic 
Timestamp  algorithm  has  higher  throughput  of  read-only  transactions  than 
the  Wait-Die  algorithm. 

The  Majority  Consensus  algorithm  also  performs  poorly  because  it 
delays  lock  conflict  detection  until  transaction  end,  thus  resulting  in 
aany  late  transaction  abortions.  In  fact,  all  certifier  algorithms  that 
certify  transactions  at  transaction  end  perform  badly  in  the  long  tran¬ 
saction  environment.  The  Primary  Site  #  Primary  Site  (C-C)  and  the  Pri¬ 
mary  Copy  $  Primary  Copy  (P-P)  algorithms  also  perform  relatively  well 
when  the  database  is  fully  redundant.  These  two  algorithms  abort  fewer 
transactions  than  the  Basic  Timestamp,  Multiple  Version  Timestamp,  DDM, 
and  Basic  Optimistic  algorithms,  and  the  savings  in  transaction  abor¬ 
tions  more  than  make  up  for  the  extra  communication  messages  required  by 
the  two  algorithms.  The  Basic-Basic  algorithm  does  not  perform  as  well 
because  it  requires  many  more  communica tion  messages  than  other  algo¬ 
rithms. 

To  summarize,  in  this  environment  transaction  blocking  is  better 
than  transaction  abortion,  and  early  lock  conflict  detection  is  better 
than  late  detection. 
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The  Basic  Primary,  DDK,  Basic  Optimistic,  and  to  certain  degree  the 
Wait-Die  algorithms  use  the  blocking  strategy;  and  the  Basic  Primary  and 
Wait-Die  algorithms  detect  lock  conflicts  as  early  aa  possible.  In 
addition,  because  of  long  communication  delay,  algorithms  requiring 
extra  communication  messages  may  not  perform  well  even  if  they  uae  tran¬ 
saction  blocking  instead  of  transaction  abortion.  The  DDN  and  the  Basic 
Primary  algorithms  require  extra  communication  messages  in  some  cases. 

The  Basic  Primary  Copy  algorithm  performs  the  best  when  the  data¬ 
base  is  not  fully  redundant  because  it  requires  no  more  communication 
messages  than  the  other  algorithms,  and  because  it  causes  fewer  unneces¬ 
sary  transaction  abortions.  Even  when  the  database  is  not  fnlly  redun¬ 
dant,  if  most  transactions  are  write  transactions  and  local  delay  is 
high  relative  to  the  communication  delay,  the  Basic  Prisiary  Copy  algo¬ 
rithm  still  performs  better  than  the  Basic  Timestamp,  Multiple  Version 
Timestamp,  DDM,  and  Basic-Optimistic  algorithms,  because  the  latter 
abort  write  transactions  frequently.  However,  when  the  database  is 
fully  redundant,  the  Basic  Primary  Copy  algorithm  requires  more  communi¬ 
cation  messages  than  the  Basic  Timestamp,  Multiple  Version  Timestamp, 
DDM,  anc  Basic  Optimistic  algorithms.  Thus,  except  for  the  cases  above, 
the  extra  communication  messages  required  by  the  Basic  Primary  Copy 
algorithm  make  its  performance  worse  than  that  of  the  Basic  Timestamp, 
Multiple  Version  Timestamp,  DDM,  and  oasic-Qptimistic  algorithm  in  this 
communication  bound  environment. 

The  timestamp  based  algorithms  perform  best  when  the  database  is 
fully  redundant,  then  read-only  transactions  incur  no  communication 
delay  and  complete  quickly.  The  read  phase  of  write  transactions  also 
completes  quickly.  When  read-only  transactions  and  the  read  phase  of 
write  transactions  complete  quickly,  conflicts  between  read-only  and 
write  transactions  that  result  in  abortion  of  the  write  transactions  is 
reduced.  Thus,  unnecessary  transaction  abortion  is  reduced. 

The  DDM  method  avoids  conflicts  between  read-only  transactions  and 
write  transactions,  but  it  pays  with  more  abortions  of  write  transac¬ 
tions  at  transaction  end.  Thus,  when  most  transactions  are  read-only, 
it  performs  very  well.  The  higher  throughput  of  read-only  transactions 
make  up  for  the  extra  abortion  of  write  transactions.  Notice  that  DDM 
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requires  a  extra  round  of  communication  messages  for  read-only  transac¬ 
tions  when  the  database  is  not  fully  redundant.  Then  its  perforaance 
degrades. 

The  Bssic-Optiaistic  algorithm  also  performs  veil  when  most  tran¬ 
sactions  are  read-only;  then  read-only  transactions  and  the  read  phase 
of  vrite  transactions  complete  quickly.  Otherwise  it  performs  poorly 
because  the  system  is  eventually  saturated  with  many  long  vrite  transac¬ 
tions  that  later  abort. 

The  Wait-Die  algorithm  performs  as  veil  as  the  Basic  Timestamp 
algorithm  when  most  transactions  are  vrite  transactions,  but  not  as  veil 
vhen  most  transactions  are  read-only  transactions.  Since  the  Wait-Die 
algorithm  needs  no  overhead  for  maintaining  data  item  timestamps,  it  is 
preferable  to  the  timestamp  based  algorithms  if  most  transactions  are 
vrite  transactions. 

The  Basic  t>  Basic,  Primary  Copy  #  Primary  Copy,  and  Primary  Site  ff 
Primary  Site  algorithms  perform  poorly  because  they  require  more  commun- 
ication  messages  than  other  algorithms.  Communication  overhead  is 
expensive  in  this  communication  bound  environment. 

To  summarize,  in  this  environment  transaction  blocking  is  better 
than  transaction  abortion,  and  early  lock  conflict  detection  is  better 
than  late  detection.  However,  some  algorithms  that  use  these  tvo  stra¬ 
tegies  may  not  perform  veil  in  some  cases  because  they  require  extra 
communication  messages. 


4.4  Conclusion 

We  found  that  five  of  the  twelve  algorithms  perform  best  in  various 
system  environmentss:  Basic  Timestamp,  Multiple  Version  Timestamp,  DDM, 
Basic  Optimistic,  and  Basic-Primary  algorithms. 

When  most  transactions  are  short,  concurrency  control  algorithms 
that  abort  conflicting  transactions  (such  as  Basic  Timestamp,  Multiple 
Version  Timestamp  algorithms)  perform  better  than  algorithms  that  block 
conflicting  transactions  (such  as  the  Basic  Vrimary  algorithm).  In  this 
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Rank  numbers  have  no  absolute  meaning.  They  only  show  relative 
performance  across  a  row,  not  a  column. 

R/W:  Ratio  of  Read-only  transactions  to  Vrite  transactions 
L/C:  Ratio  of  Local  delay  to  Communication  delay,  excluding 
queuing  delay 

Red:  Redundancy  of  the  database 
*  :  Does  not  matter 


Figure  4.6  Performance  Comparison:  Long 

Transactions  $  Communication  Bound 

environment,  transactions  conflict  rarely;  and  when  they  do  conflict, 
the  blocking  transactions  tend  to  be  longer  than  the  average  transaction 
size  and  blocking  delay  long  [LINN83].  If  a  two-phase  locking  algorithm 
must  be  used,  algorithms  that  delay  lock  conflict  checking  (such  as  the 
DDN  and  the  Basic  Optimistic  algorithms)  perform  better  than  those  that 
expedite  lock  conflict  checking  (such  as  the  Basic  Primary  algorithm). 
Unless  the  communication  bandwidth  is  very  high,  communication  delay  can 
devastate  system  performance;  thus,  the  designer  should  reduce  communi¬ 
cation  delay  by  locally  controlling  and  accessing  data  as  much  as  possi¬ 
ble. 


The  issue  of  balancing  communication  delay  against  dats  distribu¬ 
tion  and  replication  is  part  of  the  complex  problem  of  distributed  data¬ 
base  design.  Distributed  database  design  must  also  take  into  account 
the  issues  of  distributed  query  processing  and  distributed  database 
reliability,  and  is  beyond  the  scope  of  this  handbook. 

Behavior  of  systems  thst  have  long  transactions  is  very  different 
from  that  of  systems  that  have  short  transactions.  Long  transactions 
degrade  system  performance  very  quickly  because  they  have  more  transac¬ 
tion  conflicts.  Since  only  a  fraction  of  these  conflicts  results  in 
deadlocks,  concurrency  control  algorithms  that  use  transaction  blocking 
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often  perform  better  than  those  that  use  transaction  abortion  indiscrim¬ 
inately.  Moreover,  concurrency  algorithms  that  detect  transaction  con¬ 
flict  earlier  often  perfora  better  than  those  that  detect  transaction 
conflict  later.  The  effect  of  costaunication  delay  on  the  performance  of 
a  systea  that  has  long  transactions  is  even  aore  devastating  than  the 
effect  on  a  systea  that  has  short  transactions.  Thus  the  designer  aust 
reduce  coaaunication  delay  as  much  as  possible  by  controlling  and 
accessing  data  locally. 

However,  no  matter  which  concurrency  algorithm  the  designer  uses,  a 
system  that  has  long  transsctions  always  performs  worse  than  a  system 
that  has  short  transactions.  The  designer  should  design  transactions  to 
access  as  much  data  in  parallel  as  possible,  and  to  break  long  transac¬ 
tions  into  shorter  transactions.  Long  transactions  that  cannot  be  bro¬ 
ken  into  shorter  ones  must  be  executed  in  background  mode. 

Our  performance  study  shows  that  no  one  algorithm  performs  best  in 
all  systea  and  application  environments.  If  the  system  environment  is 
stable,  the  database  designer  can  select  one  algorithm  that  performs 
best  in  the  environment.  If  the  system  environment  is  not  stable,  the 
database  designer  can  assign  different  weights  to  different  environments 
according  to  how  often  the  system  stays  in  each  environment.  The  data¬ 
base  designer  then  selects  the  algorithm  that  has  the  best  weighted 
average  performance. 

From  the  results,  we  can  also  conclude  that  the  best  algorithm 
would  be  one  that  could  be  adjusted  by  the  system  administrator  accord¬ 
ing  to  the  environment.  The  administrator  would  adjust  the  algorithm  to 
use  transaction  abortion  and  delay  lock  conflict  detection  whenever 
transactions  are  short,  and  to  use  transaction  blocking  and  detect  lock 
conflicts  as  soon  as  possible  whenever  transactions  are  long.  The  adju¬ 
stable  algorithm  would  also  alternate,  depending  on  the  load  on  the  com¬ 
munication  channel,  between  algorithms  that  have  more  localized  control 
and  algorithms  that  have  more  distributed  control. 


Distributed  Database  System  Designer  Handbook 
Heferences 


Page  5-1 
Section  5 


5.  Heferences 

[AH075]  A.V.  Abo,  E.  Hopcroft  and  J.D.  Oilman,  'The  Design  and  Analysis 
of  Computer  Algorithms,*  Addison-Tesley  Publishing  Co.  (1975). 

[ALSB76]  Alsberg,  P.A.  and  J.D.  Day,  'A  Principle  for  Resilient  Sharing 
of  Distributed  Resources,*  Proc.  2nd  Int.  Conf .  on  Software 
Engineering.  October  1976. 

[ALSB78]  Alsberg,  P.A. ,  G.G.  Belford,  J.D.  Day  and  E.  Grapa,  'Multi-copy 
Resiliency  Techniques,'  Distributed  Data  Management  (J.B.  Roth- 
nie,  P.A.  Bernstein,  D.V.  Shipman,  eds.),  IEEE,  1978,  pp.  128- 
176. 

[ANDL82]  Andler,  S.,  I.  Ding,  I.  Eawaran,  C.  Hauser,  V.  Kim,  J.  Mehl,  R. 
Williams,  'System  D:  A  Distributed  System  for  Availability,' 

Proc.  8th  VLDB.  Sept.  1982,  pp.  33-44. 

[ATTA82 ]  Attar  R. ,  P.A.  Bernstein  and  N.  Goodman,  'Site  Initialisation, 
Recovery,  and  Back-up  in  a  Distributed  Database  System,'  Proc. 
6th  Berkeley  Workshop.  Feb.  1982,  pp.  185-202. 

[BADA79]  Badal,  D.Z.,  'Correctness  of  Concurrency  Control  and  Implica¬ 
tions  in  Distributed  Databases,'  Proc.  COMPSAC  79  Conf. .  Chicago, 
Nov.  1979. 

[BART82]  Bartlett,  J.F.,  'A  'NonStop'  Operating  System,'  in  The  Theory 
and  Practice  of  Reliable  Svatem  Design.  (Siewiarek  and  Swarz, 
eds.).  Digital  Press,  1982,  pp.  453-460. 

[BAYE80a]  Bayer,  R. ,  H.  Heller  and  A.  Reiser,  'Parallelism  and  Recovery 
in  Databaae  Systems,*  ACM  Trans,  on  Database  Systems.  Vol.  5,  No. 
2  (June  1980),  pp.  139-156. 

[BAYE80b]  Bayer.  R. ,  E.  Elhardt,  H.  Heller  and  A.  Reiser,  'Distributed 
Concurrency  Control  in  Database  Systems,*  Proc.  Sixth  Int.  Conf. 
on  Very  Large  Data  Bases.  IEEE.  N.Y.,  1980,  pp.  275-284. 

[BJOR72]  Bjork  L.A.  and  C.T.  Davies,  'The  semantics  of  the  preservation 
and  recovery  of  integrity  in  a  data  system,'  IBM  TR-02 .540 .  Dec. 
22.  1972. 

[BERN78]~ Bernstein,  P.A.  J.B.  Rothnie,  N.  Goodman  and  C.H,  Papadimi- 
triou,  'The  Concurrency  Control  Mechanism  of  SDI>-1;  A  System  for 
Distributed  Databases  (The  Fully  Redundant  Case),'  Trans,  on 

1 


Pm  5-2 
Section  1 


System  UailMMl  Handbook 
^^References 


Software  Engineering.  Vol.  SE-4,  No.  3  (May  1978). 

[BERN79]  Bernstein,  P.A. ,  D.  Shipman  and  V.S.  Wong,  'Formal  Aspects  of 
Serializability  in  Database  Concurrency  Control,*  TP.P.P.  Trans,  on 
Software  Enaineering.  Vol.  SE-5.  No.  3,  May  1979. 

[BERN80a]  Bernstein,  P.A.  and  D.  Shipman,  'The  Correctness  of  Con¬ 
currency  Mechanisms  in  a  System  for  Distributed  Databases  (SDD- 
1),'  ACM  Trans,  on  Database  Systems.  Vol.  5,  No.  1,  March  1980. 

[BERN80b]  Bernstein,  P.A.,  D.W.  Shipman  and  J.B.  Rothnie,  'Concurrency 
Control  in  a  System  for  Distributed  Databases  (SDD-1),'  ACM 
Trans,  on  Database  Svs.  5,  1  (March  1980),  pp.  18-51. 

[BERN81a]  Bernstein,  P.A.  and  N.  Goodman,  'Concurrency  Control  in  Dis¬ 
tributed  Database  Systems,*  ACM  Computing  Surveys.  13,  2  (June 
1981),  pp.  185-221. 

[BERN81b]  Bernstein,  P.A.  N.  Goodman  and  M.Y.  Lai  'A  Two-Part  Proof 
Schema  for  Database  Concurrency  Control,'  Proc.  1981  Berkeley 
Workshop  on  Distributed  Databases  and  Computer  Networks. 

[BERN82]  Bernstein,  P.A.  and  N.  Goodman,  'A  Sophisticate's  Introduction 
to  Distributed  Database  Concurrency  Control,*  Proc.  8th  VLDB. 
Sept.  1982,  pp.  62-76. 

[BEBN83]  Bernstein,  P.A.  and  N.  Goodman,  'Concurrency  Control  Algorithms 
of  Multiversion  Database  Systems,'  submitted  for  publication. 

[BJOR73]  Bjork,  L.A,  ’Recovery  Scenario  for  a  DB/DC  System,’  Proc.  ACM 
Nat  *  1  Conf. ■  1973,  pp.  142-146. 

[CASA79]  Casanova,  M.A.  The  Concurrency  Control  Problem  of  Database 
Systems.  Lecture  Notes  in  Computer  Science,  Vol.  116,  Springer- 
Verlag,  1981  (originally  published  as  IR-17-79,  Center  of 
Research  in  Computing  Technology,  Harvard  University,  1979). 

[CHAN82a]  Chan,  A.,  U.  Dayal,  S.  For,  N.  Goodman,  D.  Ries  and  D.  Skeen, 
'Overview  of  an  Ada  Compatible  Distributed  Database  Manager,' 
submitted  for  publication. 

[CHAN82bj  Chan,  A.  and  R.  Gray,  'Implementing  Distributed  Read-only 
Transactions,'  submitted  for  publication. 

[CHAN82]  Chan,  A.,  S.  Fox,  W.T.  Lin,  A.  Nori,  and  D.  Ries,  'The  Imple¬ 

mentation  of  an  Integrated  Concurrency  Control  and  Recovery 
Scheme,'  Proc.  ACM  SIGMOD  Conf.  on  Management  of  Data.  June  1982, 
pp.  184-191. 


Distributed  Datsbste  Systea  Designer  Handbook 
References 


Page  5-3 
Section  5 


[CHEN 80]  Cheng,  W.K.  and  G.C.  Bel  ford,  'Update  Synchronization  in  Dis¬ 
tributed  Databases,'  Proc .  2nd  Berkeley  Workshop  on  Very  Larae 
Data  Bates.  Oct.  1980. 

[CHEN82]  Cheng,  W.K.  and  G.6.  Belford,  'The  Resiliency  of  Fully  Repli¬ 
cated  Distributed  Databases,'  Proc.  6th  Berkeley  Workshop.  Feb. 
1982,  pp.  23-44. 

[COOP82]  Cooper,  E.C,  'Analysis  of  Distributed  Commit  Protocols,'  Proc. 
A£M  SIGMOD  Cpnf .  pn  Ms-Sazcaent  p£  Data.  ACM,  June  1982,  pp.  175- 
183. 

[DAVI73]  Davies,  C.T,  'Recovery  Seaantics  for  a  DB/DC  systea, '  Proc.  ACM 
Nat'l  Conf..  1973,  pp.  136-141. 

[DOLE82]  Dolev,  D,  'The  Byzantine  Generals  Strike  Again,'  £.  2_f  Mao¬ 

ri  that,  3,  1  (1982). 

[DUB082]  Dubcurdieu,  D.J.,  ' lapleaentation  of  Distributed  Transactions,' 
Proc .  Sixth  Berkeley  Workshop  on  Distributed  Data  Manaaeaent  and 
Coaputer  Networks.  1982,  pp.  81-94. 

[EAGE81]  Eager,  D.L,  'Robust  Concurrency  Control  in  a  Distributed  Data¬ 
base,'  Univ.  of  Toronto  TR  CSRG  #135,  Oct.  1981. 

[ESWA76]  Eswaran,  K.P. ,  J.N.  Gray,  R.A.  Lorie  and  l.L.  Traiger,  'The 
Notions  of  Consistency  and  Predicate  Locks  in  a  Database  Systea,' 
Coaaun.  ACM.  Vol.  19,  No.  11.  Nov.  1976,  pp.  624-633. 

[EIL177]  Ellis,  C.A. ,  *A  Robust  Algoritha  for  Updating  Duplicate  Data¬ 
bases,  '  Proc.  2nd  Berkeley  Workshop  on  Distributed  Databases  and 
Coaputer  Networks.  May  1977. 

[FISC82]  Fischer,  M.J.  and  A.  Michael,  ’Sacrificing  Serializability  to 
Attain  High  Availability  of  Data  in  an  Unreliable  Network,'  Proc. 
1st  ACM  S1GACT-SIGMOD  Swap,  on  Principles  of  Database  Syateas. 
ACM,  Mar.  1982.  pp.  70-75. 

[GALL82]  Galler,  B.I.,  Ph.D  Thesis,  Univ.  of  Toronto,  1982. 

[GARC78]  Garcia-Molina,  H.,  'Performance  Comparisons  of  Two  Update  Algo¬ 
rithms  for  Distributed  Databases,'  Proc.  3rd  Berkeley  Workshop  on 
Distributed  Databases  and  Coaputer  Networks,  August  1978. 

[GARC79a]  Garcia-Molina,  H, ,  'Performance  of  Update  Algorithms  for 
Replicated  Data  in  a  Distributed  Dstsbase,'  Ph.D.  Dissertation, 
Coaputer  Science  Department,  Stanford  University,  June  1979. 


Page  5-4 
Section  5 


Distributed  Database  System  Designer  Handbook 

References 


[GARC79b]  Garcia-Molina,  H. ,  'A  Concurrency  Control  Mechanism  for  Dis¬ 
tributed  Data  Bases  Which  Use  Centralized  Locking  Controllers,' 
Proc.  4th  Berkeley  Workshop  on  Distributed  Data  Management  I  Com¬ 
puter  Networks.  August  1979. 

[GARC82]  Garcia-Molina,  H,.  'Elections  in  a  Distributed  Computing  Sys¬ 
tem,’  IEEE  Trans  on  Computers  C-31,  l(Jan.  1982),  pp.  48-59. 

[GELE78]  Gelenbe,  E.  and  K.  Sevcik,  'Analysis  of  Update  Synchronization 
for  Multiple  Copy  Data  Bases,’  Proc. 3rd  Berkeley  Workshop  on  Dis¬ 
tributed  Databases  and  Computer  Networks .  August  1978. 

[GIFF79]  Gifford,  D.K,  'Weighted  Voting  for  Replicated  Data,'  Proc.  7th 
Symp.  on  Operating  Systems  Principles.  ACM,  Dec.  1979,  pp.  ISO- 
159. 

[GLIG80]  Gligor,  V.D.  and  S.H.  Shattuck,  ’On  Deadlock  Detection  in  Dis¬ 
tributed  Systems,'  IEEE  Trans,  on  Software  Engineering.  Vol.  SE- 
6,  No.  5,  September  1980,  pp.  435-440. 

[GRAY75]  Gray,  J.N.,  R.A.  Lorie,  G.R.  Putzulo  and  I.L.  Traiger,  'Granu¬ 
larity  of  Locks  and  Degrees  of  Consistency  in  a  Shared  Data 
Base,’  IBM  Research  Report  RJ1654,  September  1975. 

[GRAY78]  Gray,  J.N.,  'Notes  on  Database  Operating  Systems,'  Operating 
Systems :  An  Advanced  Course.  Vol.  60,  Lecture  Notes  in  Computer 

Science.  Springer-Verlag,  N.,Y  1978,  pp.  393-481. 

[GRAY81]  Gray,  J.N.,  P.  McJones,  M.  Blasgen,  B.  Lindsay,  R.  Lorie,  T. 

Price,  F.  Putzulo  and  I.  Traiger,  'The  Recovery  Manager  of  the 
System  R  Database  Manager,'  ACM  Computing  Surveys.  13,  2  (June 
1981),  pp.  223-242. 

[HAMM80]  Hammer,  M.M.  and  D.W.  Shipman,  'Reliability  Mechanisms  for 
SDD-1 :  A  System  for  Distributed  Databases,'  ACM  Trans,  on  Data¬ 

base  Sys. .  Vol.  5,  No.  5  (Dec.  1980),  pp.  431-466. 

[HARD79]  Harder,  T.  and  A.  Reuter,  'Optimization  of  logging  and  recovery 
in  a  database  system,’  in  Database  Architecture.  Bracchi  and 
Nijssen,  eds.,  North-Holland,  1979,  pp.  151-168. 

[HARD82]  Barder,  T.  and  A.  Reuter,  'Principles  of  Transaction  Oriented 
Database  Recovery  —  A  Taxonomy,'  Univ.  Kaiserslautern  TR  50/82. 

(HOLT72 ]  Holt,  R.C.,  'Some  Deadlock  Properties  of  Computer  Systems, ’ 
Computing  Surveys  4,  3  (Dec.  1972),  pp.  179-195. 


Distributed  Database  System  Designer  Handbook 
References 


Page  5-5 
Section  5 


[EANE79]  Eaneko,  A.,  T.  Nishihara,  E.  Tinrooka,  and  M.  Hat  tori,  'Logical 
Clock  Synchronisation  Method  for  Duplicated  Database  Control,' 
Proc .  First  Int* 1.  Conf .  on  Distributed  Computing  Sts teas.  IEEE, 
N.Y.,  October  1979,  pp.  601-611. 

[EAVA79]  Eawazu,  S.,  S.  Minaaib,  E.  Itoh,  and  E.  Teranaka,  'Two-Phase 
Deadlock  Detection  Algoritha  in  Distributed  Databases,'  Proc . 
1979  Int* 1.  Conf.  on  Very  Large  Data  Bases.  IEEE,  N.Y. 

[EING74]  Eing,  P.F  and  A.J.  Collaeyer,  'Database  Sharing  —  An  Efficient 
Method  for  Supporting  Concurrent  Processes,'  Proc.  1974  NCC, 
AF1PS  Press,  Montvale,  NJ,  1974. 

[EIM79]  Eim,  E.H,  'Error  Detection,  Reconfiguration  and  Recovery  in  Dis¬ 
tributed  Processing  Systems,'  Conf.  on  Dist'd  Coaputina.  IEEE, 
1979,  pp.  284-294. 

[EUNG79]  Eung,  H.T.  and  J.T.  Robinson,  'On  Optimistic  Methods  for  Con¬ 
currency  Control,'  Proc. 1979  Conf.  on  Very  Larae  Data  Bases .  Rio 
de  Janeiro,  Brazil,  October  1979. 

[LAMP76]  Laapson,  B.V.  and  H.  Sturgis,  'Crash  Recovery  in  a  Distributed 
Storage  System,'  Technical  Report,  Xerox  PARC  (1976) 

[LAMP78a]  Lamport,  L,  'The  Implementation  of  Reliable  Distributed  Mul¬ 
tiprocess  Systems,*  Computer  Networks.  I  2  (1978),  pp.  95-114. 

[LAMP78b]  Lamport,  L. ,  'Time,  Clocks  and  the  Ordering  of  Events  in  a 
Distributed  System,*  Comm,  of  the  ACM  21 .  7,  (July  1978),  pp. 
558-565. 

[LAMP82]  Lamport,  L. ,  R.  Shostak  and  M.  Pease,  'The  Byzantine  Generals 
Problem,'  ACM  Trans .  on  Proarammina  Languages  and  Systems.  Vo’. 
4.  No.  3  (July  1982).  pp.  382-401. 

[LELA78]  LeLann,  G,  'Algorithms  for  Distributed  Data-Sharing  Systems 
Vhich  Use  Tickets,'  Proc.  3rd  Berkeley  Workshop  on  Distributed 
Databases .  and  Computer  Networks.  August  1978. 

[LIND79]  Lindsay,  B.G.  et  al.,  'Notes  on  Distributed  Databases,'  IBM 
Research  Report,  No.  RJ2571,  July  1979. 

(LIN79]  Lin,  V.E.,  'Concurrency  Control  in  a  Multiple  Copy  Distributed 
Data  Base  System,'  Proc .  4th  Berkeley  Workshop  on  Distributed 
Data  Management  I  Computer  Networks.  August  1979. 


Page  5-6 
Section  5 


Distributed  Database  Syttea  Designer  Handbook 

References 


ILIN81]  Lin,  W.K.,  'Performance  Evaluation  of  Two  Concurrency  Control 
Mechanisms  in  a  Distributed  Database  System,'  ACM  SIGMOD-81 
International  Conference  on  Management  of  Data,  April  1981,  Ann 
Arbor,  MI. 

[LlN81a]  Lin,  W.K.,  et  al,  'Distributed  Database  Control  and  Allocation 
First  Semiannual  Technical  Report,'  July  8,  1981,  Computer  Corp. 
of  America,  Cambridge,  MA. 

[LIN82a]  Lin,  W.K. ,  et  al,  'Distributed  Database  Control  and  Allocation 
Second  Semiannual  Technical  Report,'  Jan.  8,  1982,  Computer  Corp. 
of  America,  Cambridge,  MA. 

[LIN82b]  Lin,  W.K.,  et  al,  'Distributed  Database  Control  and  Allocation 
Third  Semiannual  Technical  Report,’  July  8,  1982,  Computer  Corp. 
of  America,  Cambridge,  MA. 

[LIN83]  Lin,  W.K. ,  et  al,  'Distributed  Database  Control  and  Allocation 
Final  Technical  Report,'  Feb.  8,  1983,  Computer  Corp.  of  America, 
Cambridge,  MA. 

[LINN82a]  Lin,  W.K.  and  J.  Nolte,  'Performance  of  Two  Phase  Locking,' 
Proc.  1982  Berkeley  Workshop  on  Distributed  Data  Management  8 
Computer  Ne tworks .  pp.  131-160. 

[LINN82b]  Lin,  W.K.  and  J.  Nolte,  'Read-Only  Transaction  and  Two  Phase 
Locking,'  2nd  IEEE  Symposium  on  Reliability  in  Distributed 
Software  and  Database  Systems,  July  1982,  Pittsburgh,  PA. 

[LINN82c]  Lin,  W.K.  and  J.  Nolte,  'Communication  Delay  and  Two  Phase 
Locking,’  3rd  International  Conference  on  Distributed  Computing 
Systems,  Oct.  1982,  Fort  Lauderdale,  FL. 

[LINN83]  Lin,  W.K.  and  J.  Nolte,  'Basic  Timestamp,  Multiple  Version 
Timestamp,  and  Two  Phase  Locking,’  submitted  for  publication. 

[LORI77]  Lorie,  R.A,  'Physical  Integrity  in  a  Large  Segmented  Database,' 
ACM  Trans .  on  Database  Sys. .  Vol.  2,  No.  1  (Mar,  1977),  pp.  91- 
104. 

[MIN079]  Minoura,  T. ,  'A  New  Concurrency  Control  Algorithm  for  Distri¬ 
buted  Data  Base  Systems,'  Proc.  4th  Berkeley  Conf .  on  Distributed 
Data  Management  8  Computer  Networks ■  August  1979. 

[MENA79]  Mena  see,  D.A.  and  R.R.  Muntz,  'Locking  and  Deadlock  Detection 
in  Distributed  Databases.'  TREE  Transactions  on  Software 
Engineering.  Vol.  SE-5 ,  No.  3,  May  1979,  pp.  195-202. 


Distributed  Dstsbsse  System  Designer  Handbook 
References 


Page  5-7 
8ection  5 


[MENA80a]  Menasce,  D.A.,  G.J.  Popek  and  R.R.  Muntz,  'A  Locking  Protocol 
for  Resource  Coordination  in  Distributed  Databases,'  ACM  Trans. 
on  Database  Sys. .  Vol.  5,  No.  2,  (June  1980),  pp.  103-138. 

[MENA80b]  Menasce,  D.A.  and  O.E.  Landes,  'On  the  Design  of  a  Reliable 
Storage  Component  for  Distributed  Database  Management  Systems,' 
Proc.  6th  VLPB.  Oct.  1980,  pp.  365-375. 

[MONT78]  Montgomery,  V.A.,  'Robust  Concurrency  Control  for  a  Distributed 
Information  System, '  Ph.D.  dissertation.  Laboratory  for  Computer 
Science,  MIT,  December  1978. 

[PAPA77]  Papadimitr ion,  C.A.,  Bernstein,  P.A.  and  Rothnie,  J.B. ,  Jr., 
'Some  Computational  Problems  Related  to  Database  Concurrency  Con¬ 
trol,'  Proc.  Conf .  on  Theoretical  Computer  Science.  Waterloo, 
Ontario.  August  1977. 

IPAPA79]  Papadimitriou,  C.A. ,  ' Serial izabil i ty  of  Concurrent  Updates,* 
Journal  of  ACM.  Vol.  26.  No.  4,  Oct.  1979,  pp.  631-653. 

[PARK82]  Parker,  D.S.  and  K.A.  Ramas,  'A  Distributed  File  System  Archi¬ 
tecture  Supporting  High  Availability,'  Proc.  8th  VLPB.  Sept. 
1982,  pp.  161-184. 

(PEAS 80 ]  Pease,  M. ,  R.  Shostak  and  L.  Lamport,  'Reaching  Agreement  in 
the  Presence  of  Faults,'  JACM.  27,  2  (1980),  pp.  228-234. 

[RAPP75]  Rappaport,  R.L,  'File  Structure  Design  to  Facilitate  On-Line 
Instantaneous  Updating,'  Proc.  of  the  1975  SIGMOD  Conf. .  pp.  1- 
14. 

[REED78]  Reed,  D.P.  ,  'Naming  and  Synchronization  in  Decentralized  Com¬ 
puter  Systems,’  Ph.D.  Dissertation,  MIT  Department  of  Electrical 
Engineering,  Sep.  1978. 

[REED79]  Reed,  D.P,  'Implementing  Atomic  Actions,'  Proc.  7th  ACM  Svmp. 
on  Operating  Systems  Principles.  ACM,  Dec.  1979. 

[REUT80]  Reuter,  A,  'A  Fast  Transaction-Oriented  Logging  Scheme  for  Undo 
Recovery,'  TREK  Trans  on  Soft.  Ena. .  SE-6  (July  1980),  pp.  348- 
356. 

[RIES79a]  Ries,  D. ,  'The  Effects  of  Concurrency  Control  on  the  Perfor¬ 
mance  of  a  Distributed  Data  Management  System,'  Proc.  4th  Berke- 
l£Z  Conf.  pn  Distributed  Data  Management  8  Computet  Networks. 
August  1979,  Berkeley,  CA,  pp.  75-112. 


P  ge  5-8 
Section  5 


Distributed  Database  System  Designer  Handbook 

References 


[RIES79b]  Ries,  D..  'The  Effect  of  Concurrency  Control  on  Database 
Management  System  Performance,'  Ph.D  Thesis,  Electronics  Research 
Lab.,  Univ.  Of  CA,  Berkeley,  1979. 

[R1ES82]  Ries,  D.,  A.  Chan,  U.  Dayal.  S.A.  Fox,  W.I.  Lin  and  L.  Yedvab, 
‘Decompilation  and  Optimization  of  ADAPLEX:  A  Procedural  Data¬ 
base  Language'  Tech.  Rep.  CCA-82-04,  Computer  Corp.  of  America, 
Cambridge,  MA.  (in  preparation  1982). 

[ROSE78J  Rosenkrantz,  D.J.,  R.E.  Steams  and  P.M.  Levis,  'System  Level 
Concurrency  Control  for  Distributed  Database  Systems,'  ACM  Trans. 
on  Database  Systems,  Vol.  3,  .No.  2,  June  1978,  pp.  178-198. 

[SCHL79]  Schlageter,  G.,  'Enhancement  of  Concurrency  in  DBS  by  the  Use 
of  Special  Rollback  Methods,'  Qj)  Architecture.  Bracchi  and 
Nijssen,  eds..  Nor th-Hol land,  1979,  pp.  141-149. 

[SHAP77]  Shapiro,  R.V.  and  R.E.  Millstein,  'Reliability  and  Fault 

Recovery  in  Distributed  Processing,'  Oceans  '77  Conf.  Record. 
Vol.  II,  Los  Angeles,  CA,  1977. 

[SILL80]  Sil lberschatz,  A.  and  Z.  Kedem,  'Consistency  in  Hierarchical 
Database  Systems,’  Journal  of  the  ACM.  Vol.  27,  No.  lm,  Jan  1980, 
pp.  72-80. 

[SKEE81a]  Skeen,  D.,  'Crash  Recovery  in  a  Distributed  Database  System,' 
Ph.D.  Thesis,  Dept,  of  Elec,  Eng.  and  Comp.  Sci.,  Univ.  of  CA, 
Berkeley,  1981. 

[SREE81b]  Skeen,  D.  and  M.  Stonebraker,  'A  Formal  Model  of  Crash 

Recovery  in  a  Distributed  System,'  Proc.  5th  Berkeley  Workshop. 
1981,  pp.  129-142. 

[SKEE82a]  Skeen,  D,  'Nonblocking  umit  Protocols,'  Proc.  1982  ACM- 
SIGMOD  Conf.  on  Management  of  Data.  ACM,  pp.  133-147. 

[SEEE82b]  Skeen,  D,  *A  Quorum  Based  Commit  Protocol,'  Proc.  6th  Berkeley 
Workshop.  Feb.  1982,  pp.  69-80. 

[STEA76]  Steams,  R.E.,  P.M.  Levis,  II  and  D.J.  Rosenkrantz,  'Con¬ 
currency  Controls  for  Database  Systems,’  Proc.  of  the  17th  1 

Symposium  on  Foundations  of  Computer  Science .  IEEE,  1976,  pp. 
19-32. 

[STEA81]  Steams,  R.E.  and  D.J.  Rosenkrantz,  'Distributed  Database  Con¬ 
currency  Controls  Using  Before-Values,'  Proc.  1981  ACM-SIQMOD 
Conf. ■  ACM,  N.Y.,  pp.  74-83. 

I 

L 


Distributed  Database  System  Designer  Handbook 
References 


Page  5-9 

Section  5 


[ST&081]  Strom,  B.I,  ‘Consistency  of  Redundant  Databases  in  a  Weakly 
Coupled  Distributed  Computer  Conferencing  System,'  Proc .  5 th 
Berkeley  Workshop.  1981,  pp.  143-153. 

[SP  .  ,'9]  Stonebraker,  M.  ,  'Concurrency  Control  and  Consistency  of  Multi¬ 
ple  Copies  of  data  in  Distributed  INGRES,’  IEEE  Trans .  on  Soft. 
Eng..  Vol .  SE-5 ,  No.  3,  May  1979,  pp.  188-194. 

[THOM79]  Thomas,  R.H.,  'A  Majority  Consensus  Approach  to  Concurrency 
Control  for  Multiple  Copy  Databases,*  ACM  Traits,  fin  Database  Sys¬ 
tems  .  Vol.  4,  No.  2,  June  1979,  pp.  180-209. 

[TRAI82]  Traiger,  I.L.,  J.  Gray,  C.A.  Galtier  and  B.G.  Lindsay,  'Tran¬ 
sactions  and  Consistency  in  Distributed  Database  Systems,'  ACM 
Trans .  on  Database  Systems .  Vol.  7,  No.  3,  (Sept.  1982),  pp. 
323-342. 

[VERH78]  Verhofstad,  J.M.S.,  'Recovery  Techniques  for  Database  Systems,' 
ACM  Comput ing  Surveys.  10,  2  (1978),  pp.  167-196. 

[VERH79]  Verhofstad,  J.M.S,  'Recovery  Based  on  Types,’  |)g  Architecture . 

Bracchi  and  Nijssen,  eds.,  North-Holland,  1979,  pp.  125-139. 
[WALT82]  Walter,  B,  'A  Robust  and  Efficient  Protocol  for  Checking  the 
Availability  of  Remote  Sites,'  Proc.  6th  Berkeley  Workshop.  Feb. 
1982,  pp.  45-68. 


Distributed  Database  System  Designer  Handbook 


Page  A-1 
Appendix  A 


A. 


Notations  used  in  tbe  appendix  are  explained  here  and  in  the  figures. 

READ  THROUGHPUT:  average  number  of  read-only  requests  successfully 

processed  per  unit  time  (excluding  requests  processed 
and  subsequently  aborted;. 

WRITE  THROUGHPUT:  Average  number  of  write  requests  successfully 

processed  per  unit  time  (excluding  requests  processed 
and  subsequently  aborted;. 

Average  Response  Per  Read  Request:  average  time  required  to  process 
a  read-only  request. 

Average  Response  Per  Write  Request:  average  time  required  to  process 
a  write  request. 

Basic  Basic  :  Basic  and  Basic  algorithm. 

Prmry  Prmry  :  Primary  Copy  and  Primary  Copy  algorithm. 

Cntrl  :  Primary  Site  and  Primary  Site  algorithm. 

Basic  Prmry  :  Basic  and  Primary  Copy  algorithm. 

Basic  Cntrl  :  Basic  and  Primary  Site  algorithm. 

Basic  Tstmp  :  Basic  Timestamp  algorithm. 

Mltpl  Versn  :  DDM  Multiple  Version  and  Optimistic  algorithm. 

Basic  Optms  :  Basic  and  Optimistic  algorithm. 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 

§  Multiple  programming  levels  at  the  three  site  are  16/8/8. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

TZ  :  Average  no.  of  requests  per  transaction  (transaction  size). 
DZ  :  Total  number  of  data  items  in  the  database  (database  size). 
MP  :  Multiprogramming  level. 

R/(R+W)  :  Percentage  of  transactions  that  are  read-only. 

10/Comm  :  Ratio  of  local  delay  to  communication  delay 
(excluding  queueing  delay). 

No.  of  Copy  :  Fraction  of  the  database  residing  at  sites  SI,  S2, 


'esiding  at  sites  SI,  S2,  &  S3. 


Figure  A.1  READ  THROUGHPUT:  Short  Transaction 
Loaded  &  Communication  Bound 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 

§  Multiple  programming  levels  at  the  three  site  are  16/8/8. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

TZ  :  Average  no.  of  requests  per  transaction  (transaction  size). 

DZ  :  Total  number  of  data  items  in  the  database  (database  size). 

MP  :  Multiprogramming  level. 

R/(R+W)..:  Percentage  of  transactions  that  are  read-only. 

10/Comm  :  Ratio  of  local  delay  to  communication  delay 
(excluding  queueing  delay). 

No.  of  Copy  :  Fraction  of  the  database  residing  at  sites  SI,  S2,  A  S3. 


Figyre  A. 2  WRITE  THROUGHPUT:  Short  Trar 
Loaded  &  Communication  Bounds 


iction 


Page  A- 4 
Appendix  A 


Distributed  Database  System  Designer  Handbook 


TZ=4 ,  DZ=8192 


MP  |R/(R 110  /Ino.  of  copy ! Basic JPrmry (Cntrl  Basic {Basic [Basic [Ml tpl 
l+W)  ICommlSI  |S2  |S§  (Basic ' Prmry (Total  Prmry (Cntrl ITstmpjVersn 


Basic { 
Optms ' 


•  25$ 

•  50$ 

•  75$ 

§  50$ 

€  75$ 
§  50$ 

§  75$ 

#  50$ 

*  75$ 

#  50$ 

#  75$ 


1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
1  1  1 
2/3  2/3  2/3 
2/3  2/3  2/3 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/1  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 
2/3  2/3  2/3 
1/2  1/2  1/2 


2 

2/3 

2/3 

2/3 

3.6 

2 

1/2 

1/2 

1/2 

4.2 

2 

1 

1/2 

1/2 

3-5 

2 

1 

1/2 

1/2 

2.8 

.63 

2 

1 

1/2 

1/2 

2.0 

1 .3 

5 

1 

1/2 

1/2 

4.0 

5 

1 

1/2 

1/2 

3.4 

5 

1 

1/2 

1/2 

2.5 

1 

1 

1/2 

1/2 

4.4 

1 

1 

1/2 

1/2 

3.7 

1.9 

1 

1 

1/2 

1/2 

3.2 

2.7 

2 

1 

1/2 

1/2 

4.9 

2 

1 

1/2 

1/2 

4.6 

2 

1 

1/2 

1/2 

4.3 

2 

1 

1/2 

1/2 

2.1 

.40 

2 

1 

1/2 

1/2 

1.3 

.5? 

1 

1 

1/2 

1/2 

2.7 

1.4 

1 

1 

1/2 

1/2 

2.2 

1  .8 

2 

1 

1/2 

1/2 

1.0 

•32 

2 

1 

1/2 

1/2 

•  71 

.34 

1 

1 

1/2 

1/2 

1 .8 

1 .2 

1 

1 

1/2 

1/2 

1.5 

1 .2 

3.4  3-5 


#  Multiple  programming  levels  at  the  three  site  are  10/11/11. 

#  Multiple  programming  levels  at  the  three  site  are  16/8/8. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

TZ  :  Average  no.  of  requests  per  transaction  (transaction  size). 
DZ  :  Total  number  of  data  items  in  the  database  (database  size). 
MP  :  Multiprogramming  level. 

R/(R+W)-:  Percentage  of  transactions  that  are  read-only. 

10/Comm  :  Ratio  of  local  delay  to  communication  delay 
(excluding  queueing  delay). 

No.  of  Copy  :  Fraction  of  the  database  residing  at  sites  SI,  S2, 


:  Fraction  of  the  database  residing  at  sites  SI,  S2,  &  S3. 

Figure  A. 3  Average  Response  Per  Read  Request 

Short  Transactions  &  Communication  Bound 
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TZ=4,  DZr 81 92 


•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 

§  Multiple  programming  levels  at  the  three  site  are  16/8/8. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

TZ  :  Average  no.  of  requests  per  transaction  (transaction  size). 

DZ  :  Total  number  of  data  items  in  the  database  (database  size). 

MP  :  Multiprogramming  level. 

R/(R+W)  :  Percentage  of  transactions  that  are  read-only. 

10/Comm' :  Ratio  of  local  delay  to  communication  delay 
(excluding  queueing  delay). 

No.  of  Copy  :  Fraction  of  the  database  residing  at  sites  SI,  S2,  &  S3. 

Figure  A. 4  Average  Response  Per  Write  Request,  Short 
Transactions  &  Communication  Bound 
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TZ  =  Average  number  of  requests  per  transaction  (transaction  size). 

DZ  =  Total  number  of  data  items  in  the  database  (database  size). 

MP  =  Multiprogramming  level. 

R/(R+W)  =  Percentage  of  transactions  that  are  read-only. 

IO/Comm  =  local  delay/message  communication  delay/data  communication  delay 
no.  of  copy  =  Fraction  of  the  database  residing  at  each  site. 


Figure  A. 5  Through-Put  (Read/Write):  Short  Transactions 
&  Communication  Bound 
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*  Multiple  programming  levels  at  the  three  site  are  10/11/11. 

€  Multiple  programming  levels  at  the  three  site  are  16/8/6. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

Assumptions: 

Queueing  for  local  processing  is  simulated. 

Two  kinds  of  local  processing  delay  are  simulated: 

message  processing  delay  and  data  processing  delay. 

The  average  round  trip  communication  Is  fixed  at  1 
The  message  processing  delay  is  fixed  at  5%  of  the 
5%  of  round  trip  communication  delay 


Ratio  of  data  processing  &  message  processing  delay  is  10 
The  ratio  of  data  processing  delay  to  round  trip 


communication  delay  is  shown  in  column  ' 10/Com' 

Notation: 

TZ  =  Average  number  of  requests  per  transection  (transaction  size), 
DZ  s  Total  number  of  data  items  in  the  database  (database  size). 

MP  =  Multiplr  programming  level. 

R/W  =  Percentage  of  transactins  that  are  read-only. 

10/Com  s  Ratio  of  local  data  processing  delay  to  communication 
delay  (excluding  queueing) . 

Database  Copies  s  Fraction  of  the  database  residing  at  each  site. 

Figure  A. 6  Through-Put  (Read/Write) ,  Short 
Transactions  &  10  Bounded 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11, 

8  Multiple  programming  levels  at  the  three  site  are  16/8/6. 

#  Multiple  programming  levels  at  the  three  site  are  24/4/4. 

Assumptions : 

Queueing  for  local  processing  is  simulated. 

Two  kinds  of  local  processing  delay  are  simulated: 

message  processing  delay  and  data  processing  delay. 

The  average  round  trip  communication  is  fixed  at  1 
The  message  processing  delay  is  fixed  at  5%  of  the 
5%  of  round  trip  communication  delay 

i  —  _  n  _ •  .  . _ _  _ _ 


Notation: 

TZ  =  Average  number  of  requests  per  transaction  (transaction  size). 
DZ  =  Total  number  of  data  items  in  the  database  (database  size). 

MP  =  Multi plr  programming  level. 

R/«  =  Percentage  of  transactlns  that  are  read-only. 

10/Com  ^  Ratio  of  local  data  processing  delay  to  communication 
delay  (excluding  queueing). 

Database  Copies  =  Fraction  of  the  database  residing  at  each  site. 

Figure  A. 7  Average  Response  Time  (Read/Write): 

Short  Transactions  &  10  Bound 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 
Ratio  or  local  data  processing  &  message  processing  delay  is  10 


Assumption: 

Queueing  for  local  processing  is  simulated. 

Two  kinds  of  local  processing  are  simulated: 

(message  and  data  processing). 

The  round  trip  communication  Is  fixed  at  1 
The  local  message  processing  delay  is  fixed  at 
5%  of  the  round  trip  communication  delay 
The  ratio  of  local  data  processing  delay  to  round  trip 
communication  delay  is  shown  In  colume  ' 10/Comm' 


Notation: 

TZ  =  Average  number  of  requests  per  transaction. 

DZ  =  Total  number  of  data  items  In  the  database. 

MP  =  Multiple  programming  level. 

R/W  =  Ratio  of  read-only  to  write  transactions. 

IO/Com  =  Ratio  of  local  data  processing  delay  to 

communication  delay  (excluding  queueing). 
Database  Copies  =  Fraction  of  the  database  at  each  site. 


Figure  A. 8  Through-Put  (Read/Write):  Long 
Transaction  Loaded  &  10  Bound 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 
Ratio  of  local  data  processing  &  message  processing  delay  is  10 


Assumption : 

Queueing  for  local  processing  is  simulated. 

Two  kinds  of  local  processing  are  simulated: 

(message  and  data  processing). 

The  round  trip  communication  Is  fixed  at  1 
The  local  message  processing  delay  is  fixed  at 
5$  of  the  round  trip  communication  delay 
The  ratio  of  local  data  processing  delay  to  round  trip 
communication  delay  is  shown  In  colume  ' IO/Comm' 


Notation: 

TZ  =  Average  number  of  requests  per  transaction. 

DZ  =  Total  number  of  data  items  In  the  database. 

MP  =  Multiple  programming  level. 

R/W  =  Ratio  of  read-only  to  write  transactions. 

10/Com  =  Ratio  of  local  data  processing  delay  to 

communication  delay  (excluding  queueing). 
Database  Copies  =  Fraction  of  the  database  at  each  site. 

Figure  A. 9  Average  Response  Time:  Long 

Transaction  Loaded  &  10  Bound 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11, 


Assumption: 

Queueing  for  communication  channel  is  simulated. 

Only  one  kind  of  local  processing  is  simulated. 

The  average  round  trip  communication  is  fixed  at  1 
The  ratio  of  local  data  processing  delay  to  round  trip 
communication  delay  is  shown  in  colume  ’ IO/Comm' 


Notation: 

TZ  =  Average  number  of  requests  per  transaction. 

DZ  =  Total  number  of  data  items  in  the  database. 

HP  =  Multiple  programming  level. 

R/W  s  Rati  of  read-only  to  write  transactions. 

IO/Com  =  Ratio  of  local  processing  delay  to  communication 
delay  (excluding  queueing  delay). 

Database  Copies  =  Fraction  of  the  database  at  each  site. 

Figure  A. 10  Through-Put  (Read/Write):  Long  Transaction 
Loaded  &  Communicaton  Bound 
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•  Multiple  programming  levels  at  the  three  site  are  10/11/11. 
Assumption: 

Queueing  for  communication  channel  is  simulated. 

Only  one  kind  of  local  processing  is  simulated. 

The  average  round  trip  communication  is  fixed  at  1 
The  ratio  of  local  data  processing  delay  to  round  trip 
communication  delay  is  shown  in  colume  ’IO/Comm1 


Notation: 

TZ  =  Average  number  of  requests  per  transaction. 

DZ  =  Total  number  of  data  items  In  the  database. 

MP  =  Multiple  programming  level. 

R/W  =  Rati  of  read-only  to  write  transactions. 

10/Com  =  Ratio  of  local  processing  delay  to  communication 
delay  (excluding  queueing  delay). 

Databae  Copies  =  Fraction  of  the  database  at  each  site. 

Figure  A. 11  Average  Response  Time  (Read/Write) 

Long  Transaction  A  Communication  Bound 
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